diff --git a/cs3223/cheatsheet.pdf b/cs3223/cheatsheet.pdf index 4ee8392..668ae8f 100644 Binary files a/cs3223/cheatsheet.pdf and b/cs3223/cheatsheet.pdf differ diff --git a/cs3223/cheatsheet.typ b/cs3223/cheatsheet.typ index 40576ee..ea1d79b 100644 --- a/cs3223/cheatsheet.typ +++ b/cs3223/cheatsheet.typ @@ -9,10 +9,10 @@ - Each track is broken up into sectors - Cylinder is the same tracks across all surfaces - Block comprises of multiple sectors -- *Disk Access Time* - $"Seek time" + "Rotational Delay" + "Transfer Time"$ - - *Seek Time* - Move arms to position disk head - - *Rotational Delay* - $1/2 60/"RPM"$ - - *Transfer time*(for n sectors) - $n times "time for 1 revolution"/ "sectors per track"$ +- *Disk Access Time*: $"Seek time" + "Rotational Delay" + "Transfer Time"$ + - *Seek Time*: Move arms to position disk head + - *Rotational Delay*: $1/2 60/"RPM"$ + - *Transfer time*(for n sectors): $n times "time for 1 revolution"/ "sectors per track"$ - $n$ is requested sectors on track - Access Order @@ -26,9 +26,9 @@ - Each frame maintains pin count(PC) and dirty flag === Replacement Policies - Decide which unpinned page to replace -- *LRU* - queue of pointers to frames with PC = 0 -- *clock* - LRU variant - - *Reference bit* - turns on when PC = 0 +- *LRU*: queue of pointers to frames with PC = 0 +- *clock*: LRU variant + - *Reference bit*: turns on when PC = 0 - Replace a page when ref bit off and PC = 0 #image("clock-replacement-policy.png") @@ -61,7 +61,7 @@ - *Composite search key* if $k > 1$ - *unique key* if search key contains _candidate_ key of table - index is stored as file -- *Clustered index* - Ordering of data is same as data entries +- *Clustered index*: Ordering of data is same as data entries - key is known as *clustering key* - Format 1 index is clustered index (Assume format 2 and 3 to be unclustered) @@ -86,8 +86,8 @@ === Insertion + *Leaf node Overflow* - Redistribute and then split - - *Split* - Create a new leaf $N$ with $d+1$ entries. Create a new index entry $(k, square.filled)$ where $k$ is smallest key in $N$ - - *Redistribute* - If sibling is not full, take from it. If given right, update right's parent pointer, else current node's parent pointer + - *Split*: Create a new leaf $N$ with $d+1$ entries. Create a new index entry $(k, square.filled)$ where $k$ is smallest key in $N$ + - *Redistribute*: If sibling is not full, take from it. If given right, update right's parent pointer, else current node's parent pointer + *Internal node Overflow* - Node has $2d+1$ keys. - Push middle $(d+1)$-th key up to parent. @@ -156,3 +156,95 @@ - Collisions: If they have same hashed value. - Need overflow pages if collisions exceed page capacity +#colbreak() += Sorting +== Notation +#table( + columns: (auto, auto), + $|r|$, [pages for R], + $||r||$, [tuples in r], + $pi_L (R)$, [project column by list $L$ from $R$], + $pi_L^* (R)$, [project with duplicates], + + +) +== External Merge Sort +- *File size*: $N$ pages +- Memory pages available: $B$ +- *Pass 0*: Create sorted runs + - Read and sort $B$ pages at a time +- *Pass i*: Use $B-1$ pages for input, 1 for output, performing $B-1$-way merge sort +- *Analysis* + - Sorted runs: $N_0 = ceil(N/B)$ + - Total passes: $ceil(log_(B-1) (N_0))+1$ + - Total I/O: $2 N (ceil(log_(B-1) (N_0))+1)$ +=== Optimized Merge Sort +- Read and write in blocks of $b$ pages + - Allocate 1 Block for output + - Remaining memory for input: $floor(B/b)-1$ blocks +- *Analysis* + - sorted runs: $N_0 = ceil(N/B)$ + - Runs Merged at each pass $F = floor(B/b)-1$ + - No of merge passes: $ceil(log_F (N_0))$(+1 for total) + - Total IO: $2 N (ceil(log_F (N_0))+1)$ +- *Sorting with B+ Trees*: IO Cost: $h$ + Scan of leaf pages + Heap access (If not covering index) + +== Projection +=== Sort based approach +- Extract attributes, Sort attributes, remove duplicates +- *Analysis* + + Extract Attributes: $|R|"(scan)" + |pi_L^*(R)| "(output)"$ + + Sort Attributes: + - $N_0 = ceil((|pi_L^*(R)|)/B)$ + - Merging Passes: $log_(B-1) (N_0)$ + - Total IO: $2 |pi_L^*(R)| (log_(B-1) (N_0)+1)$ + + Remove Duplicates: $|pi_L^*(R)|$ +=== Optimized approach +- Merge Split step 2 into Creating and Merging sorted runs, and merge into step 1 and 3 respectively +- *Analysis* + - *Step 1* + - $B-1$ pages for initial sorted run + - Sorted Runs: $N_0 = ceil((|pi^*_L (R)|) / (B-1))$ + - Create sorted run = $|R| + |pi^*_L (R)|$ + - *Step 2* + - Merging passes: $ceil(log_(B-1) (N_0))$ + - Cost of merging: $2 |pi^*_L (R)| ceil(log_(B-1) (N_0))$ + - Cost of merging excluding IO output: $(2 ceil(log_(B-1) (N_0))-1) |pi^*_L (R)| $ + +=== Hash based approach +- *Partitioning* + - Allocate 1 page for input, $B-1$ page for output. + - Read 1 page at a time, for each tuple, create projection, hash($h$) to distribute to $B-1$ buffers + - Flush to disk when full. +- *Duplicate Elimination* + - For each partition $R_i$, create hash table, hash each tuple with hash function $h' != h$ to bucket $B_j$ if $t in.not B_j$ +- *Partition Overflow*: hash table for $pi^*_L (R_i)$ is larger than memory pages allocated for $pi_L (R)$ + +- *Analysis* + - IO Cost (no partition overflow) : $|R| + 2|pi^*_L (R)|$ + - Partitioning Phase: $|R| + |pi^*_L (R)|$ + - Duplicate Elimination: $|pi^*_L (R)|$ + - To Avoid partition overflows: + - $|R_i| = (|pi^*_L (R)|) / (B-1)$ + - $B > "size of hash table", |R_i| times f$ + - $B > sqrt(f times |pi^*_L (R)|)$ + += Selection +- *Conjunct*: $1>=$ terms connected by $or$ +- *CNF predicate*: $1>=$ conjuncts connected by $and$ +- *Covered Conjunct* - predicate $p_i$ is covered conjunct if each attribute in $p_i$ is in key $K$ or include column of Index $I$ + - $sigma_p (R), p = ("age" > 5) and ("height" = 180) and ("level" = 3), I_1 "key" = ("level", "weight", "height"$ + - $p_c = +- *Primary Conjunct* - +- $sigma_p (R)$: Select rows from $R$ that satisfy predicate $p$ +- Access Path: way of accessing data records / entries + - *Table Scan*: Scan all data pages (Cost: $|R|$) + - *Index Scan*: Scan index pages + - *Index Combination*: Combine from multiple index scans + - Scan/Combination can be followed by RID lookup to retrieve data +- *Index only plan*: Query where it does not need to access any data tuples in $R$ +- *Covering Index*: $I$ is covering index if all of $R$s attribute in query is part of the key / include columns of $I$ + +== B+ Trees +- For Index Scan + RID Lookup, many matching RIDs could refer to same page + - Sort matching RIDs before performing lookup: Avoid retrieving same page