current progress

This commit is contained in:
2025-10-05 17:11:04 +08:00
parent a4b1c9b357
commit 645dea80aa
2 changed files with 102 additions and 10 deletions

Binary file not shown.

View File

@@ -9,10 +9,10 @@
- Each track is broken up into sectors
- Cylinder is the same tracks across all surfaces
- Block comprises of multiple sectors
- *Disk Access Time* - $"Seek time" + "Rotational Delay" + "Transfer Time"$
- *Seek Time* - Move arms to position disk head
- *Rotational Delay* - $1/2 60/"RPM"$
- *Transfer time*(for n sectors) - $n times "time for 1 revolution"/ "sectors per track"$
- *Disk Access Time*: $"Seek time" + "Rotational Delay" + "Transfer Time"$
- *Seek Time*: Move arms to position disk head
- *Rotational Delay*: $1/2 60/"RPM"$
- *Transfer time*(for n sectors): $n times "time for 1 revolution"/ "sectors per track"$
- $n$ is requested sectors on track
- Access Order
@@ -26,9 +26,9 @@
- Each frame maintains pin count(PC) and dirty flag
=== Replacement Policies
- Decide which unpinned page to replace
- *LRU* - queue of pointers to frames with PC = 0
- *clock* - LRU variant
- *Reference bit* - turns on when PC = 0
- *LRU*: queue of pointers to frames with PC = 0
- *clock*: LRU variant
- *Reference bit*: turns on when PC = 0
- Replace a page when ref bit off and PC = 0
#image("clock-replacement-policy.png")
@@ -61,7 +61,7 @@
- *Composite search key* if $k > 1$
- *unique key* if search key contains _candidate_ key of table
- index is stored as file
- *Clustered index* - Ordering of data is same as data entries
- *Clustered index*: Ordering of data is same as data entries
- key is known as *clustering key*
- Format 1 index is clustered index (Assume format 2 and 3 to be unclustered)
@@ -86,8 +86,8 @@
=== Insertion
+ *Leaf node Overflow*
- Redistribute and then split
- *Split* - Create a new leaf $N$ with $d+1$ entries. Create a new index entry $(k, square.filled)$ where $k$ is smallest key in $N$
- *Redistribute* - If sibling is not full, take from it. If given right, update right's parent pointer, else current node's parent pointer
- *Split*: Create a new leaf $N$ with $d+1$ entries. Create a new index entry $(k, square.filled)$ where $k$ is smallest key in $N$
- *Redistribute*: If sibling is not full, take from it. If given right, update right's parent pointer, else current node's parent pointer
+ *Internal node Overflow*
- Node has $2d+1$ keys.
- Push middle $(d+1)$-th key up to parent.
@@ -156,3 +156,95 @@
- Collisions: If they have same hashed value.
- Need overflow pages if collisions exceed page capacity
#colbreak()
= Sorting
== Notation
#table(
columns: (auto, auto),
$|r|$, [pages for R],
$||r||$, [tuples in r],
$pi_L (R)$, [project column by list $L$ from $R$],
$pi_L^* (R)$, [project with duplicates],
)
== External Merge Sort
- *File size*: $N$ pages
- Memory pages available: $B$
- *Pass 0*: Create sorted runs
- Read and sort $B$ pages at a time
- *Pass i*: Use $B-1$ pages for input, 1 for output, performing $B-1$-way merge sort
- *Analysis*
- Sorted runs: $N_0 = ceil(N/B)$
- Total passes: $ceil(log_(B-1) (N_0))+1$
- Total I/O: $2 N (ceil(log_(B-1) (N_0))+1)$
=== Optimized Merge Sort
- Read and write in blocks of $b$ pages
- Allocate 1 Block for output
- Remaining memory for input: $floor(B/b)-1$ blocks
- *Analysis*
- sorted runs: $N_0 = ceil(N/B)$
- Runs Merged at each pass $F = floor(B/b)-1$
- No of merge passes: $ceil(log_F (N_0))$(+1 for total)
- Total IO: $2 N (ceil(log_F (N_0))+1)$
- *Sorting with B+ Trees*: IO Cost: $h$ + Scan of leaf pages + Heap access (If not covering index)
== Projection
=== Sort based approach
- Extract attributes, Sort attributes, remove duplicates
- *Analysis*
+ Extract Attributes: $|R|"(scan)" + |pi_L^*(R)| "(output)"$
+ Sort Attributes:
- $N_0 = ceil((|pi_L^*(R)|)/B)$
- Merging Passes: $log_(B-1) (N_0)$
- Total IO: $2 |pi_L^*(R)| (log_(B-1) (N_0)+1)$
+ Remove Duplicates: $|pi_L^*(R)|$
=== Optimized approach
- Merge Split step 2 into Creating and Merging sorted runs, and merge into step 1 and 3 respectively
- *Analysis*
- *Step 1*
- $B-1$ pages for initial sorted run
- Sorted Runs: $N_0 = ceil((|pi^*_L (R)|) / (B-1))$
- Create sorted run = $|R| + |pi^*_L (R)|$
- *Step 2*
- Merging passes: $ceil(log_(B-1) (N_0))$
- Cost of merging: $2 |pi^*_L (R)| ceil(log_(B-1) (N_0))$
- Cost of merging excluding IO output: $(2 ceil(log_(B-1) (N_0))-1) |pi^*_L (R)| $
=== Hash based approach
- *Partitioning*
- Allocate 1 page for input, $B-1$ page for output.
- Read 1 page at a time, for each tuple, create projection, hash($h$) to distribute to $B-1$ buffers
- Flush to disk when full.
- *Duplicate Elimination*
- For each partition $R_i$, create hash table, hash each tuple with hash function $h' != h$ to bucket $B_j$ if $t in.not B_j$
- *Partition Overflow*: hash table for $pi^*_L (R_i)$ is larger than memory pages allocated for $pi_L (R)$
- *Analysis*
- IO Cost (no partition overflow) : $|R| + 2|pi^*_L (R)|$
- Partitioning Phase: $|R| + |pi^*_L (R)|$
- Duplicate Elimination: $|pi^*_L (R)|$
- To Avoid partition overflows:
- $|R_i| = (|pi^*_L (R)|) / (B-1)$
- $B > "size of hash table", |R_i| times f$
- $B > sqrt(f times |pi^*_L (R)|)$
= Selection
- *Conjunct*: $1>=$ terms connected by $or$
- *CNF predicate*: $1>=$ conjuncts connected by $and$
- *Covered Conjunct* - predicate $p_i$ is covered conjunct if each attribute in $p_i$ is in key $K$ or include column of Index $I$
- $sigma_p (R), p = ("age" > 5) and ("height" = 180) and ("level" = 3), I_1 "key" = ("level", "weight", "height"$
- $p_c =
- *Primary Conjunct* -
- $sigma_p (R)$: Select rows from $R$ that satisfy predicate $p$
- Access Path: way of accessing data records / entries
- *Table Scan*: Scan all data pages (Cost: $|R|$)
- *Index Scan*: Scan index pages
- *Index Combination*: Combine from multiple index scans
- Scan/Combination can be followed by RID lookup to retrieve data
- *Index only plan*: Query where it does not need to access any data tuples in $R$
- *Covering Index*: $I$ is covering index if all of $R$s attribute in query is part of the key / include columns of $I$
== B+ Trees
- For Index Scan + RID Lookup, many matching RIDs could refer to same page
- Sort matching RIDs before performing lookup: Avoid retrieving same page