368 lines
15 KiB
Typst
368 lines
15 KiB
Typst
#set page(paper: "a4", flipped: true, margin: 0.5cm, columns: 4)
|
|
#set text(size: 8pt)
|
|
#show heading: set block(spacing:0.6em)
|
|
|
|
= Storage
|
|
- Parts of disk
|
|
- Platter has 2 surfaces
|
|
- Surface has many tracks
|
|
- Each track is broken up into sectors
|
|
- Cylinder is the same tracks across all surfaces
|
|
- Block comprises of multiple sectors
|
|
- *Disk Access Time*: $"Seek time" + "Rotational Delay" + "Transfer Time"$
|
|
- *Seek Time*: Move arms to position disk head
|
|
- *Rotational Delay*: $1/2 60/"RPM"$
|
|
- *Transfer time*(for n sectors): $n times "time for 1 revolution"/ "sectors per track"$
|
|
- $n$ is requested sectors on track
|
|
|
|
- Access Order
|
|
+ Contiguous Blocks within same track (same surface)
|
|
+ Cylinder track within same cylinder
|
|
+ next cylinder
|
|
|
|
== Buffer Manager
|
|
#image("buffer-manager.png")
|
|
- Data stored in block sized pages called frames
|
|
- Each frame maintains pin count(PC) and dirty flag
|
|
=== Replacement Policies
|
|
- Decide which unpinned page to replace
|
|
- *LRU*: queue of pointers to frames with PC = 0
|
|
- *clock*: LRU variant
|
|
- *Reference bit*: turns on when PC = 0
|
|
- Replace a page when ref bit off and PC = 0
|
|
#image("clock-replacement-policy.png")
|
|
|
|
== Files
|
|
- Heap File Implementation
|
|
- Linked List
|
|
- 2 linked lists, 1 of free pages, 1 of data pages
|
|
- Page Directory Implementation
|
|
- Directory structure, 1 entry per page.
|
|
- to insert, scan directory to find page with space to store record
|
|
*Page Formats*
|
|
- *RID* = (page id, slot number)
|
|
- Fixed Length records
|
|
- Packed Organization: Store records in contiguous slots (requires swapping last item to deleted location during deletion)
|
|
- Unpacked organization: Use bit array to maintain free slots
|
|
- *Variable Length Records*: Slotted page organization
|
|
|
|
*Record Formats*
|
|
- Fixed Length Records: Stored consecutively
|
|
- Variable length Records
|
|
- Delimit fields with special symbols (F1, \$, F2 \$, F3)
|
|
- Array of field offsets ($o_1, o_2, o_3, F 1, F 2, F 3$)
|
|
*Data Entry Formats*
|
|
1. $k*$ is an actual data record (with search key value k)
|
|
2. $k*$ is of the form *(k, rid)*
|
|
3. $k*$ is of the form *(k, rid-list)* list of rids of data with key $k$
|
|
|
|
= B+ Tree index
|
|
- *Search key* is sequence of $k$ data attributes $k >= 1$
|
|
- *Composite search key* if $k > 1$
|
|
- *unique key* if search key contains _candidate_ key of table
|
|
- index is stored as file
|
|
- *Clustered index*: Ordering of data is same as data entries
|
|
- key is known as *clustering key*
|
|
- Format 1 index is clustered index (Assume format 2 and 3 to be unclustered)
|
|
|
|
== Tree based Index
|
|
- *root node* at level 0
|
|
- Height of tree = no of levels of internal node
|
|
- *Leaf nodes*
|
|
- level h, where h is height of tree
|
|
- *internal nodes* store entries in form $(p_0, k_1, p_1, k_2, p_2, ..., p_n)$
|
|
- $k_1 < k_2 < ... < k_n$
|
|
- $p_i$ = disk page address
|
|
- *Order* of index tree
|
|
- Each non-root node has $m in [d, 2d]$ entries
|
|
- Root node has $m in [1, 2d]$ entries
|
|
- *Equality search*: At each _internal_ node $N$, find largest key $k_i$ in N, such that $k_i <= k$
|
|
- if $k_i$ exists, go subtree $p_i$, else $p_0$
|
|
- *Range search*: First matching record, and traverse doubly linked list
|
|
- *Min nodes at level* i is $2 times (d + 1)^(i-1), i >= 1$
|
|
- *Max nodes at level* i is $(2d + 1)^(i)$
|
|
|
|
=== Operations (Right sibling first, then left)
|
|
=== Insertion
|
|
+ *Leaf node Overflow*
|
|
- Redistribute and then split
|
|
- *Split*: Create a new leaf $N$ with $d+1$ entries. Create a new index entry $(k, square.filled)$ where $k$ is smallest key in $N$
|
|
- *Redistribute*: If sibling is not full, take from it. If given right, update right's parent pointer, else current node's parent pointer
|
|
+ *Internal node Overflow*
|
|
- Node has $2d+1$ keys.
|
|
- Push middle $(d+1)$-th key up to parent.
|
|
=== Deletion
|
|
+ *Leaf node*
|
|
- Redistribute then merge
|
|
- *Redistribution*
|
|
- Sibling must have $> d$ recordsto borrow
|
|
- Update parent pointers to right sibling's smallest key)
|
|
- *Merge*
|
|
- If sibling has $d$ entries, then merge
|
|
- Combine with sibling, and then remove parent node
|
|
+ *Internal Node Underflow*
|
|
- Let $N'$ be adjacent _sibling_ node of $N$ with $l, l > d$ entries
|
|
- Insert $(K, N' . p_i)$ into $N$, where $i$ is the leftmost(0) or rightmost entry(l)
|
|
- Replace $K$ in parent node with $N'.k_i$
|
|
- Remove $(p_i, k_i)$ entry from $N'$
|
|
=== Bulk Loading
|
|
+ Sort entries by search keys.
|
|
+ Load leaf pages with $2d$ entries
|
|
+ For each leaf page, insert index entry to rightmost parent page
|
|
|
|
|
|
= Hash based Index
|
|
== Static Hashing
|
|
- Data stored in $N$ buckets, where hash function $h(dot)$ is used to id bucket
|
|
- record with key $k$ is inserted into $B_i, "where" i = h(k) mod N$
|
|
- Bucket is primary data page with 0+ overflow data pages
|
|
== Linear Hashing
|
|
- Grows linearly by splitting buckets
|
|
- Systematic splitting: Bucket $B_i$ is split before $B_(i+1)$
|
|
- Let $N_i = 2^i N_0$ be file size at beginning of round $i$
|
|
- How to split bucket $B_i$
|
|
- Add bucket $B_j$ (split image of $B_i$)
|
|
- Redistribute entries in $B_i$ between $B_i$ and $B_j$
|
|
- `next++; if next == NLevel: (level++; next = 0)`
|
|
=== Performance
|
|
- Average: 1.2 IO for uniform data
|
|
- Worst Case: Linear in number of entries
|
|
|
|
== Extensible Hashing
|
|
- Overflowed bucket is resolved by splitting overflowed bucket
|
|
- No overflow pages, and order in which buckets are split is random
|
|
- Directory of pointers to buckets, directory has $2^d$ entries
|
|
- $d$ is global depth of hashed file
|
|
- Each bucket maintains a local depth $l in [0, d]$
|
|
- Entries in a bucket of local depth $l$: same last $l$ bits
|
|
=== Bucket Overflow
|
|
- Number of directory entries could be more than number of buckets
|
|
- Number of dir entries pointing to bucket = $2^(d-l)$
|
|
- When bucket $B$ with depth $l$ overflows,
|
|
- Increment local depth of $B$ to $l+1$
|
|
- Allocate split image $B'$
|
|
- Redistribute entries between $B$ and $B'$ using $(l+1)$th bit
|
|
- if $l+1 > "global depth " d$
|
|
- Directory is doubled in size, , global depth to $d+1$
|
|
- New entries point to same bucket as corresponding entry
|
|
- if $l+1 <= "global depth " d$
|
|
- Update dir entry corresponding to split bucket's directory entry to point to split image
|
|
|
|
=== Bucket Deletion
|
|
- $B_i$ & $B_j$(with same local depth $l$ and differ only in $l$th bit) can be merged if entries fit bin bucket
|
|
- $B_i$ is deallocated, $B_j$'s local depth decremented by 1. Directory entries that point to $B_i$ points to $B_j$
|
|
=== Performance
|
|
- At most 2 disk IOs for equality selection
|
|
- Collisions: If they have same hashed value.
|
|
- Need overflow pages if collisions exceed page capacity
|
|
|
|
#colbreak()
|
|
= Sorting
|
|
== Notation
|
|
#table(
|
|
columns: (auto, auto),
|
|
$|r|$, [pages for R],
|
|
$||r||$, [tuples in r],
|
|
$pi_L (R)$, [project column by list $L$ from $R$],
|
|
$pi_L^* (R)$, [project with duplicates],
|
|
|
|
$b_d$, [Data records that can fit on page],
|
|
$b_i$, [Data entries that can fit on page],
|
|
$b_r$, [RIDs that can fit on page],
|
|
|
|
|
|
)
|
|
== External Merge Sort
|
|
- *File size*: $N$ pages
|
|
- Memory pages available: $B$
|
|
- *Pass 0*: Create sorted runs
|
|
- Read and sort $B$ pages at a time
|
|
- *Pass i*: Use $B-1$ pages for input, 1 for output, performing $B-1$-way merge sort
|
|
- *Analysis*
|
|
- Sorted runs: $N_0 = ceil(N/B)$
|
|
- Total passes: $ceil(log_(B-1) (N_0))+1$
|
|
- Total I/O: $2 N (ceil(log_(B-1) (N_0))+1)$
|
|
=== Optimized Merge Sort
|
|
- Read and write in blocks of $b$ pages
|
|
- Allocate 1 Block for output
|
|
- Remaining memory for input: $floor(B/b)-1$ blocks
|
|
- *Analysis*
|
|
- sorted runs: $N_0 = ceil(N/B)$
|
|
- Runs Merged at each pass $F = floor(B/b)-1$
|
|
- No of merge passes: $ceil(log_F (N_0))$(+1 for total)
|
|
- Total IO: $2 N (ceil(log_F (N_0))+1)$
|
|
- *Sorting with B+ Trees*: IO Cost: $h$ + Scan of leaf pages + Heap access (If not covering index)
|
|
|
|
== Projection
|
|
=== Sort based approach
|
|
- Extract attributes, Sort attributes, remove duplicates
|
|
- *Analysis*
|
|
+ Extract Attributes: $|R|"(scan)" + |pi_L^*(R)| "(output)"$
|
|
+ Sort Attributes:
|
|
- $N_0 = ceil((|pi_L^*(R)|)/B)$
|
|
- Merging Passes: $log_(B-1) (N_0)$
|
|
- Total IO: $2 |pi_L^*(R)| (log_(B-1) (N_0)+1)$
|
|
+ Remove Duplicates: $|pi_L^*(R)|$
|
|
=== Optimized approach
|
|
- Merge Split step 2 into Creating and Merging sorted runs, and merge into step 1 and 3 respectively
|
|
- *Analysis*
|
|
- *Step 1*
|
|
- $B-1$ pages for initial sorted run
|
|
- Sorted Runs: $N_0 = ceil((|pi^*_L (R)|) / (B-1))$
|
|
- Create sorted run = $|R| + |pi^*_L (R)|$
|
|
- *Step 2*
|
|
- Merging passes: $ceil(log_(B-1) (N_0))$
|
|
- Cost of merging: $2 |pi^*_L (R)| ceil(log_(B-1) (N_0))$
|
|
- Cost of merging excluding IO output: $(2 ceil(log_(B-1) (N_0))-1) |pi^*_L (R)| $
|
|
|
|
=== Hash based approach
|
|
- *Partitioning*
|
|
- Allocate 1 page for input, $B-1$ page for output.
|
|
- Read 1 page at a time, for each tuple, create projection, hash($h$) to distribute to $B-1$ buffers
|
|
- Flush to disk when full.
|
|
- *Duplicate Elimination*
|
|
- For each partition $R_i$, create hash table, hash each tuple with hash function $h' != h$ to bucket $B_j$ if $t in.not B_j$
|
|
- *Partition Overflow*: hash table for $pi^*_L (R_i)$ is larger than memory pages allocated for $pi_L (R)$
|
|
|
|
- *Analysis*
|
|
- IO Cost (no partition overflow) : $|R| + 2|pi^*_L (R)|$
|
|
- Partitioning Phase: $|R| + |pi^*_L (R)|$
|
|
- Duplicate Elimination: $|pi^*_L (R)|$
|
|
- To Avoid partition overflows:
|
|
- $|R_i| = (|pi^*_L (R)|) / (B-1)$
|
|
- $B > "size of hash table", |R_i| times f$
|
|
- $B > sqrt(f times |pi^*_L (R)|)$
|
|
|
|
= Selection
|
|
- *Conjunct*: $1>=$ terms connected by $or$
|
|
- *CNF predicate*: $1>=$ conjuncts connected by $and$
|
|
- *Covered Conjunct* - predicate $p_i$ is covered conjunct if each attribute in $p_i$ is in key $K$ or include column of Index $I$
|
|
- $p = ("age" > 5) and ("height" = 180) and ("level" = 3)$
|
|
- $I_1 "key" = ("level", "weight", "height")$
|
|
- $p_c "wrt" I_1 = ("height" = 180) and ("level" = 3)$
|
|
- *Primary Conjunct*
|
|
- $I$ matches $p$ if attributes in $p$ form prefix of $K$ and all comparison operators are equality except last
|
|
- $p_p$ is largest subset of conjuncts in $p$ such that $I$ matches $p_p$
|
|
- $sigma_p (R)$: Select rows from $R$ that satisfy predicate $p$
|
|
- Access Path: way of accessing data records / entries
|
|
- *Table Scan*: Scan all data pages (Cost: $|R|$)
|
|
- *Index Scan*: Scan index pages
|
|
- *Index Combination*: Combine from multiple index scans
|
|
- Scan/Combination can be followed by RID lookup to retrieve data
|
|
- *Index only plan*: Query where it does not need to access any data tuples in $R$
|
|
- *Covering Index*: $I$ is covering index if all of $R$s attribute in query is part of the key / include columns of $I$
|
|
|
|
== B+ Trees
|
|
- For Index Scan + RID Lookup, many matching RIDs could refer to same page
|
|
- Sort matching RIDs before performing lookup: Avoid retrieving same page
|
|
|
|
=== Analysis
|
|
#{
|
|
let nin = [$N_"internal"$]
|
|
let nle = [$N_"leaf"$]
|
|
let nlo = [$N_"lookup"$]
|
|
let nso = [$N_"sort"$]
|
|
let nco = [$N_"combine"$]
|
|
[
|
|
Cost of index scan = $nin + #nle + nlo$
|
|
- *#nin*: No of internal nodes accessed
|
|
- Height of B+ tree index
|
|
- $ "height(est)" = cases(
|
|
ceil(log_F (ceil( (||R||) / b_d))) &"if index is clustered",
|
|
ceil(log_F (ceil( (||R||) / b_i))) &"otherwise",
|
|
) $
|
|
- *#nlo*: Data pages accessed for RID lookups
|
|
- If $I$ is covering index for $sigma_p (R), nlo = 0$
|
|
- else $nlo = ||sigma_p_c (R)||$
|
|
- If matching RIDs are sorted before RID lookup
|
|
- $nlo = nso + min{||sigma_p_c (R)||, |R|}$
|
|
- *#nso*: sorting matching RIDs
|
|
- $nso = 0 "if" ceil( (||sigma_p_c (R)||) / b_r ) <= B$ (if RIDs can fit into $B$)
|
|
- $ nso = 2 ceil((||sigma_p_c (R)||) / b_r) ceil(log_(B-1) (N_0)), N_0 = ceil(ceil((||sigma_p_c (R)||) / b_r) / B) $
|
|
- Sorting with External Merge Sort
|
|
- #nso does'nt include read IO for pass 0 as its included in #nin and #nle
|
|
- #nso does'nt incldue write IO for final merging pass as RID is used for lookup
|
|
- *#nle*: Leaf pages scanned for evaluating $sigma_p (R)$
|
|
- $nle = ceil((||sigma_p_p (R)||)/b_d)$ if clustered
|
|
- $nle = ceil((||sigma_p_p (R)||)/b_i)$ if unclustered
|
|
- *Index Combination*
|
|
- Cost = $nin^p + nle^p + nin^q + nle^q + nco + nlo$
|
|
- #nco: IO cost to compute join of $pi_p$ $pi_q$
|
|
- If $min{|pi_X_p (S_p)|, |pi_X_q (S_q)|} <= B$
|
|
- One of the join operands can fit in mem, then $nco = 0$
|
|
== Hash based Index Scan
|
|
- Cost: $N_"dir" + N_"bucket" + nlo$
|
|
- $N_"dir"$: no of directory pages accessed (1 if extensible hash, 0 otherwise)
|
|
- $N_"bucket"$: max no of index's primary/overflow pages accessed
|
|
- $nlo= nso + min{||sigma_p_c (R)||, |R|}$ if I is not covering index for $sigma_p (R)$
|
|
]}
|
|
|
|
= Join Algorithms
|
|
Considerations when choosing join Algorithm
|
|
- Types of join predicates (Equality / inequality)
|
|
- Sizes of join operands
|
|
- Allocated memory pages
|
|
- Available Access Methods
|
|
- *Notation*: $R join_(A) S$
|
|
- $R$ is outer relation, $S$ is inner relation
|
|
- Nested Loop Join(NLJ) and Partition based join
|
|
== Tuple-based NLJ
|
|
Iterate through each page R, for each tuple in page, iterate through each page S, for each tuple in S, check if matches
|
|
- *Cost*: $|R| + ||R|| times |S|$ - Read $S$ for each tuple in $R$
|
|
- *Optimised*: Page based, Iterate through page R, iterate page S, then iterate tuples and check matching
|
|
- *Cost*: $|R| + |R| times |S|$ - Read $S$ for each page in $R$
|
|
|
|
== Main Memory NLJ
|
|
Assuming $|S| < |R|$, for optimal IO, compute $R join S$ with smaller operand as inner relation
|
|
- Min pages needed: $B = |S| + 2$
|
|
- 1 for $B_"outer"$, 1 for $B_"join"$, rest for $B_"inner"$ to read $S$
|
|
- *Cost*: $|R| + |S|$
|
|
|
|
== Block NLJ
|
|
- $R join S = union.big^k_(i=1) (R_i join S)$, $k = ceil((|R|)/B_"outer")$
|
|
- To min IO Cost, we min $|R|$ or max $B_"outer"$
|
|
- Choose smaller table as outer, ($R "if" |R| < |S|$)
|
|
- IO Cost: $|R| + ceil((|R|)/(B-2)) times |S|$
|
|
|
|
== Index NLJ
|
|
- Inner column: Table with index
|
|
- Cost: $|R| + ||R|| times (N_"internal" + N_"leaf" + N_"lookup")$
|
|
- Scan $R$ + search index for each tuple in $S$
|
|
|
|
== Sort-Merge Join
|
|
- $R join S = union.big_(i in J) (R_i times S_i), "where" J = {i | R_i != emptyset, S_i != emptyset}$
|
|
- $X_i subset.eq X$ is partition of $X$ where all records have join attribute value $i$
|
|
- *Cost*: Sort $R$ + Sort $S$ + Merging cost
|
|
- Sorting cost: $0$ if sorted, or internal sorting
|
|
- Min merging cost: max $|B_"inner"|$
|
|
- $S$ to be inner relation if $|"Max"P_S|<= |"Max"P_R|$
|
|
- $"Max"P_x$: largest matching $X$-partition
|
|
- If $|"Max"P_S|<= B-2$, Cost: $|R| + |S|$
|
|
- else $|S| + ceil((|S|)/(B-2)) |R|$
|
|
== Optimized SMJ
|
|
- $S$ to be inner relation if $|"Max"P_S|<= |"Max"P_R|$
|
|
- Find $i$ & $j$, $B > N(R, i) + N(S, j)$
|
|
- $N(X, 0) = ceil((|X|) / B)$ and $N(X, k) = ceil((N(X, k-1))/(B-1))$
|
|
- *Cost*: $2|R|(i+1) + 2|S|(j+1) + |R| + |S|$
|
|
- Partial sort R + Partial sort S + merge & join
|
|
|
|
== Grace Hash Join
|
|
- Partition $R$: $R_1, ..., R_k$, Partition $S$: $S_1, ..., S_k$
|
|
- Read $R_i$ to build hash table (Build relation)
|
|
- Read $S_i$ to probe hash table (Probe relation)
|
|
- $R_i$ overflows if hash table is larger than memory page allocated
|
|
- Recursively partition $R_i$ and $S_i$
|
|
- *To avoid overflow*
|
|
- Pick smaller operand $R$ as build relation $(|R| <= |S|)$
|
|
- Partitioning: Max build partitions to min size
|
|
- 1 page to read build $R$
|
|
- 1 page to output $B-1$ partitions
|
|
- Probing: Max memory allocated for hash table
|
|
- 1 page to read probe $S_i$
|
|
- 1 page to output $S_i join R_i$
|
|
- $B-2$ pages for $R_i$'s hash table
|
|
- $B > sqrt(f times |R|)$: size to avoid overflow
|
|
- *Cost*: $2(|R| + |S|) + (|R| + |S|) = 3(|R| + |S|)$
|
|
- 2 for partitioning, 1 for probing
|
|
|