diff --git a/cs3223/cheatsheet.pdf b/cs3223/cheatsheet.pdf index e1e3f3b..4ee8392 100644 Binary files a/cs3223/cheatsheet.pdf and b/cs3223/cheatsheet.pdf differ diff --git a/cs3223/cheatsheet.typ b/cs3223/cheatsheet.typ index 280eb14..40576ee 100644 --- a/cs3223/cheatsheet.typ +++ b/cs3223/cheatsheet.typ @@ -1,5 +1,5 @@ #set page(paper: "a4", flipped: true, margin: 0.5cm, columns: 4) -#set text(size: 9pt) +#set text(size: 8pt) #show heading: set block(spacing:0.6em) = Storage @@ -82,15 +82,15 @@ - *Min nodes at level* i is $2 times (d + 1)^(i-1), i >= 1$ - *Max nodes at level* i is $(2d + 1)^(i)$ -== Operations (Right sibling first, then left) +=== Operations (Right sibling first, then left) === Insertion + *Leaf node Overflow* - Redistribute and then split - *Split* - Create a new leaf $N$ with $d+1$ entries. Create a new index entry $(k, square.filled)$ where $k$ is smallest key in $N$ - *Redistribute* - If sibling is not full, take from it. If given right, update right's parent pointer, else current node's parent pointer + *Internal node Overflow* -- Node has $2d+1$ keys. -- Push middle $(d+1)$-th key up to parent. + - Node has $2d+1$ keys. + - Push middle $(d+1)$-th key up to parent. === Deletion + *Leaf node* - Redistribute then merge @@ -111,7 +111,48 @@ + For each leaf page, insert index entry to rightmost parent page -== Hash based Index -=== Static Hashing -=== Linear Hashing -=== Extensible Hashing += Hash based Index +== Static Hashing +- Data stored in $N$ buckets, where hash function $h(dot)$ is used to id bucket + - record with key $k$ is inserted into $B_i, "where" i = h(k) mod N$ +- Bucket is primary data page with 0+ overflow data pages +== Linear Hashing +- Grows linearly by splitting buckets + - Systematic splitting: Bucket $B_i$ is split before $B_(i+1)$ +- Let $N_i = 2^i N_0$ be file size at beginning of round $i$ +- How to split bucket $B_i$ + - Add bucket $B_j$ (split image of $B_i$) + - Redistribute entries in $B_i$ between $B_i$ and $B_j$ + - `next++; if next == NLevel: (level++; next = 0)` +=== Performance +- Average: 1.2 IO for uniform data +- Worst Case: Linear in number of entries + +== Extensible Hashing +- Overflowed bucket is resolved by splitting overflowed bucket +- No overflow pages, and order in which buckets are split is random +- Directory of pointers to buckets, directory has $2^d$ entries + - $d$ is global depth of hashed file + - Each bucket maintains a local depth $l in [0, d]$ + - Entries in a bucket of local depth $l$: same last $l$ bits +=== Bucket Overflow +- Number of directory entries could be more than number of buckets +- Number of dir entries pointing to bucket = $2^(d-l)$ +- When bucket $B$ with depth $l$ overflows, + - Increment local depth of $B$ to $l+1$ + - Allocate split image $B'$ + - Redistribute entries between $B$ and $B'$ using $(l+1)$th bit +- if $l+1 > "global depth " d$ + - Directory is doubled in size, , global depth to $d+1$ + - New entries point to same bucket as corresponding entry +- if $l+1 <= "global depth " d$ + - Update dir entry corresponding to split bucket's directory entry to point to split image + +=== Bucket Deletion +- $B_i$ & $B_j$(with same local depth $l$ and differ only in $l$th bit) can be merged if entries fit bin bucket + - $B_i$ is deallocated, $B_j$'s local depth decremented by 1. Directory entries that point to $B_i$ points to $B_j$ +=== Performance +- At most 2 disk IOs for equality selection +- Collisions: If they have same hashed value. + - Need overflow pages if collisions exceed page capacity +