current progress

2025-10-05 17:11:04 +08:00
parent a4b1c9b357
commit 645dea80aa
2 changed files with 102 additions and 10 deletions
--- a/cs3223/cheatsheet.pdf
+++ b/cs3223/cheatsheet.pdf
--- a/cs3223/cheatsheet.typ
+++ b/cs3223/cheatsheet.typ
@@ -9,10 +9,10 @@
  - Each track is broken up into sectors
  - Cylinder is the same tracks across all surfaces
  - Block comprises of multiple sectors
- *Disk Access Time* - $"Seek time" + "Rotational Delay" + "Transfer Time"$
-  - *Seek Time* - Move arms to position disk head
-  - *Rotational Delay* - $1/2 60/"RPM"$
-  - *Transfer time*(for n sectors) - $n times "time for 1 revolution"/ "sectors per track"$
+- *Disk Access Time*: $"Seek time" + "Rotational Delay" + "Transfer Time"$
+  - *Seek Time*: Move arms to position disk head
+  - *Rotational Delay*: $1/2 60/"RPM"$
+  - *Transfer time*(for n sectors): $n times "time for 1 revolution"/ "sectors per track"$
    - $n$ is requested sectors on track

 - Access Order
@@ -26,9 +26,9 @@
 - Each frame maintains pin count(PC) and dirty flag
 === Replacement Policies
 - Decide which unpinned page to replace
- *LRU* - queue of pointers to frames with PC = 0
- *clock* - LRU variant
-  - *Reference bit* - turns on when PC = 0
+- *LRU*: queue of pointers to frames with PC = 0
+- *clock*: LRU variant
+  - *Reference bit*: turns on when PC = 0
  - Replace a page when ref bit off and PC = 0
 #image("clock-replacement-policy.png")

@@ -61,7 +61,7 @@
 - *Composite search key* if $k > 1$
 - *unique key* if search key contains _candidate_ key of table
 - index is stored as file
- *Clustered index* - Ordering of data is same as data entries
+- *Clustered index*: Ordering of data is same as data entries
  - key is known as *clustering key*
  - Format 1 index is clustered index (Assume format 2 and 3 to be unclustered)

@@ -86,8 +86,8 @@
 === Insertion
 + *Leaf node Overflow*
  - Redistribute and then split
-  - *Split* - Create a new leaf $N$ with $d+1$ entries. Create a new index entry $(k, square.filled)$ where $k$ is smallest key in $N$
-  - *Redistribute* - If sibling is not full, take from it. If given right, update right's parent pointer, else current node's parent pointer
+  - *Split*: Create a new leaf $N$ with $d+1$ entries. Create a new index entry $(k, square.filled)$ where $k$ is smallest key in $N$
+  - *Redistribute*: If sibling is not full, take from it. If given right, update right's parent pointer, else current node's parent pointer
 + *Internal node Overflow*
  - Node has $2d+1$ keys. 
  - Push middle $(d+1)$-th key up to parent. 
@@ -156,3 +156,95 @@
 - Collisions: If they have same hashed value. 
  - Need overflow pages if collisions exceed page capacity

+#colbreak()
+= Sorting
+== Notation
+#table(
+  columns: (auto, auto),
+  $|r|$, [pages for R],
+  $||r||$, [tuples in r],
+  $pi_L (R)$, [project column by list $L$ from $R$],
+  $pi_L^* (R)$, [project with duplicates],
+
+
+)
+== External Merge Sort
+- *File size*: $N$ pages
+- Memory pages available: $B$
+- *Pass 0*: Create  sorted runs
+  - Read and sort $B$ pages at a time
+- *Pass i*: Use $B-1$ pages for input, 1 for output, performing $B-1$-way merge sort
+- *Analysis*
+  - Sorted runs: $N_0 = ceil(N/B)$
+  - Total passes: $ceil(log_(B-1) (N_0))+1$
+  - Total I/O: $2 N (ceil(log_(B-1) (N_0))+1)$
+=== Optimized Merge Sort
+- Read and write in blocks of $b$ pages
+  - Allocate 1 Block for output
+  - Remaining memory for input: $floor(B/b)-1$ blocks
+- *Analysis*
+  - sorted runs: $N_0 = ceil(N/B)$
+  - Runs Merged at each pass $F = floor(B/b)-1$
+  - No of merge passes: $ceil(log_F (N_0))$(+1 for total)
+  - Total IO: $2 N (ceil(log_F (N_0))+1)$ 
+- *Sorting with B+ Trees*: IO Cost: $h$ + Scan of leaf pages + Heap access (If not covering index)
+
+== Projection
+=== Sort based approach
+- Extract attributes, Sort attributes, remove duplicates
+- *Analysis*
+  + Extract Attributes:  $|R|"(scan)" + |pi_L^*(R)| "(output)"$
+  + Sort Attributes: 
+    - $N_0 = ceil((|pi_L^*(R)|)/B)$
+    - Merging Passes: $log_(B-1) (N_0)$
+    - Total IO: $2 |pi_L^*(R)| (log_(B-1) (N_0)+1)$
+  + Remove Duplicates: $|pi_L^*(R)|$
+=== Optimized approach
+- Merge Split step 2 into Creating and Merging sorted runs, and merge into step 1 and 3 respectively
+- *Analysis* 
+  - *Step 1*
+    - $B-1$ pages for initial sorted run
+    - Sorted Runs: $N_0 = ceil((|pi^*_L (R)|) / (B-1))$
+    - Create sorted run = $|R| + |pi^*_L (R)|$
+  - *Step 2*
+    - Merging passes: $ceil(log_(B-1) (N_0))$
+    - Cost of merging: $2 |pi^*_L (R)| ceil(log_(B-1) (N_0))$
+    - Cost of merging excluding IO output: $(2 ceil(log_(B-1) (N_0))-1) |pi^*_L (R)| $
+
+=== Hash based approach
+- *Partitioning*
+  - Allocate 1 page for input, $B-1$ page for output. 
+  - Read 1 page at a time, for each tuple, create projection, hash($h$) to distribute to $B-1$ buffers
+  - Flush to disk when full. 
+- *Duplicate Elimination*
+  - For each partition $R_i$, create hash table, hash each tuple with hash function $h' != h$ to bucket $B_j$ if $t in.not B_j$
+- *Partition Overflow*: hash table for $pi^*_L (R_i)$ is larger than memory pages allocated for $pi_L (R)$
+
+- *Analysis*
+  - IO Cost (no partition overflow) : $|R| + 2|pi^*_L (R)|$
+    - Partitioning Phase: $|R| + |pi^*_L (R)|$
+    - Duplicate Elimination: $|pi^*_L (R)|$
+  - To Avoid partition overflows:
+    - $|R_i| = (|pi^*_L (R)|) / (B-1)$
+    - $B > "size of hash table", |R_i| times f$
+    - $B > sqrt(f times |pi^*_L (R)|)$
+
+= Selection
+- *Conjunct*: $1>=$ terms connected by $or$
+- *CNF predicate*: $1>=$ conjuncts connected by $and$
+- *Covered Conjunct* - predicate $p_i$ is covered conjunct if each attribute in $p_i$ is in key $K$ or include column of Index $I$
+  - $sigma_p (R), p = ("age" > 5) and ("height" = 180) and ("level" = 3), I_1 "key" = ("level", "weight", "height"$
+  - $p_c = 
+- *Primary Conjunct* - 
+- $sigma_p (R)$: Select rows from $R$ that satisfy predicate $p$
+- Access Path: way of accessing data records / entries
+    - *Table Scan*: Scan all data pages (Cost: $|R|$)
+    - *Index Scan*: Scan index pages
+    - *Index Combination*: Combine from multiple index scans
+    - Scan/Combination can be followed by RID lookup to retrieve data
+- *Index only plan*: Query where it does not need to access any data tuples in $R$
+- *Covering Index*: $I$ is covering index if all of $R$s attribute in query is part of the key / include columns of $I$
+
+== B+ Trees
+- For Index Scan + RID Lookup, many matching RIDs could refer to same page
+  - Sort matching RIDs before performing lookup: Avoid retrieving same page