Notes: Storage and Retrieval

source: Designing Data Intensive Applications by Martin Kleppmann (Chapter 3)

A database has two jobs:

store data
retrieve data it’s important to pick the most appropriate storage engine for your application. Main types optimize for one of two things:
- transactional workloads
- analytics

Database Data Structures

Log File: an append only sequence of records. This is the simplest form of a database. Appending takes O(1), but reading takes O(n), so let’s not.

Solution? The index, a data structure to help you find a key in a database. This structure is derived from primary data. It makes it so that writing takes a little longer, but it’s almost always worth it.

Hash Indexes

A hash index is an in-memory hash map of keys and byte offsets. The database uses the hash index to read values from disk.

Segment Files: What you get when you separate out your database into a set of files. Because segment files are separated out and can exist without getting read, we can merge and compact them in the background
Compaction: getting rid of duplicates in a log file. You do this to decrease the amount of space your database takes up
tombstones: a deletion record for an append-only log file

There are a few ways we can make log files more resilient and efficient:

it saves space to safe log files in binary format
if you store a hash map for each segment on disk, you can recover after a crash more quickly
Use checksums to prevent corrupted data

Append only design is nice because it’s so simple. This means that you can write quickly, and crash recovery is straightforward. Also, you can merge segments quickly because you don’t have to worry about them being updated while you’re merging them.

On the other hand, append-only design has some limitations. The hash table must fit in memory for this to work, so append only design is better for situations where you have lots of writes to a view keys. Also, range queries are pretty inefficient

SSTables & LSM-Trees

Sorted String Table (SSTable), Log-Structured Merge-Tree (LSM-Tree): What we get when we sort a segment file by key. In order to make sure the LSM-Tree is sorted, you need the following:

An in-memory memtable
- The memtable represents a tree structure, so you insert key/value pairs in order
- When the memtable gets too big, you save it to disk as a segment file
On-disk Segment files
- You merge and compact these segment files from time to time
- in the event of a tie on merge, the most recent key/value pair is what you pay attention to
An on-disk log file
- write to this log file every time you insert a value into the memtable
- this protects agains crashes If you want to read from this type of database structure, you conduct a search starting with the most recent thing. This means that you start with the memtable, then move on to the on-disk segments.

There are a few ways to optimize LSM-Trees:

Bloom filters: memory efficient data structures. Use these to approximate the contents of a data set
Size-tiered compaction: Newer SSTables are smaller, and they get bigger as they get older. These older SSTables get accessed less often, so merging is less disruptive
You can also infer the location of a key’s byte offset based on the location ok keys around it, so you don’t need to keep all of the keys in memory

LSM Trees have faster writes and better compression than B-Trees.

BUT if you end up compressing data, and the process coincides with that data being queried, it will cause a bottleneck
compressing data takes longer for big databases, so LSM trees don’t really scale

B-Trees

The standard implementation for almost all relational databases (and many non-relational databases)

information is stored in key/value pairs. These key/value pairs are stored in pages
to find the value of a key, you traverse the tree
to add a key, you find a page whose range the key is in. If it fits, great. Add it to the page. Otherwise, you’ll need to split the page into two half pages, add the new value, and update the parent of the original page to point to the new pages

Important things to know about B-Trees:

pages (blocks): chunks of data (usually 4kb chunks). These act as nodes for the tree
the root: entry point for tree traversal. Contains references to other pages in the tree
leaf pages: endpoint of a tree traversal. Contains values (or at least pointers to values)
branching factor: number of references per page. This usually determines the key depth. A B-tree with n keys has a depth of O(log(n))
B-Trees have faster reads than LSM Trees, but slower writes. This is because of write amplification, the fact that you often have to make several writes to several locations for each update because of the structure of B-Trees.

Making B-Trees Reliable, Optimized

A Write-Ahead Log (WAL, redo log): is an additional append-only file on disk

every time you update a B-Tree, you also update this WAL
protects against crashes. You can use the WAL to reconstruct updates

You can use a Copy-on-write Scheme as an alternative to a WAL

Instead of inserting to a page, just make a new page with all of the original data, plus the key/value pair you want to insert
When you’re done with that, update the references of the parent page

You can also abbreviate keys to increase the branching factor.

It’s also good to keep pages close to each other. This decreases the overhead required to jump from one page to another when you’re traversing the tree

Other Indexing Structures

Secondary indexes (as opposed to primary keys) are usually used to perform efficient joins

you have to get around the fact that these second indexes are not unique. You do this by including a row identifier
you can represent secondary indexes using B-Trees or LSM trees

Storing values in the index

Clustered Index: an index that stores an entire row. These make for faster reads, but require more storage and write overhead
Nonclustered Index: an index that stores references to rows. These require less storage and are faster to write to, but a query has to jump to the heap file, which takes more time
Heap File: Place where the rows are stored. Often unordered. References in a nonclustered index point to the heap
Covering Index: (AKA Index with included columns) a compromise between a clustered and nonclustered index. It stores some of the columns in a row, but not all

Multi Column Indexes

Concatenated Index: most common multi-column index, and it just combines columns in a specific order
R-Trees: specialized spacial indexes. One alternative to R-Trees is to transalte 2-D spacial indexes into a single number

Fuzzy Indexes

These allow for mispelling and assume you don’t have the exact key. One way to do this is to use an in-memory SSTable to determine the offset (or how bad your mispelling can be)

Full-text search engines (like lucene) do this

In-memory Databases

Using disk for databases is a little awkward (because you have to format data to go on disk), and the cost-per-gigabyte in RAM is going down, so in-memory databases is a thing

advantage: they can offer more data structures (redis uses priority queues, for example)
to keep in mind: they must protect against power outages

Data Warehousing

Online Transaction Processing (OLTP): exist for daily transactions, like item sales or inventory
- reads and writes happen on close to a human scale
- used in real time, so they should be responsive
- use cases usually include reading/writing one row at a time
Online Analytics Processing (OLAP):
- used for business intellegence
- users generally want to aggregate a lot of data at once

Because OLAP might interfere with OLTP, create a separate database for it. These separate databases are data warehouses. Data warehouses are generally read-only, and the data comes initially from the OLTP database

extract (you extract data from the OLTP DB. Either batch it or stream it)
transform (transform the data you grabbed to an analysis-friendly schema)
load (load the data to the warehouse)

Once it’s in a data warehouse, analysts run computationally heavy queries without interfering with operations. Warehouses are usually relational

Star Schema

data warehouse setup where there’s a fact table (usually represents events that you’re interested in) and dimention tables (which act as picklists). If you map this out, it’ll look like a star.

One mutation on star schema is the snowflake schema, which is a star schema, but where the dimention tables are broken down into subdimentions

Column Oriented Storage

warehouse databases are gigantic, so we have to find a faster way to query them. Thankfully, the typical analytical qeury is really only interested in a few columns, so we can use that. If each column is in its own file, a query only needs to parse columns in that file

column-based storage advantage: helps reduce the bottleneck of moving things from disk to memory

Column Compression

Because data in columns is often repetative, its a good candiate for column compression.

Bitmap encoding: a compression technique where we break a bitwise representation of a value down based on the locations of ones and zeroes
representing data in this way makes it possible to use bitwise operations, which makes queries faster
Vectorized processing: when you use bitwise operations (on a vector/one dimentional array) to process data quickly

Sorting Columns

Now that the columns are all in their own files, we can sort them. We have the constraint of making sure not to scramble the information. Information in the nth row of once column file must correspond with information in the nth row of every other column file

primary columm: if you know which column you query most often (like a SKU), that’s the one you sort. We can also compress it to make it faster to query
secondary column: because sorted values in the primary column are often pretty repetitive (like, if there are several instances of a SKU being sold), we can sort the next-most-queried column within that.
a drawback to this is that you can’t insert new rows efficiently. These databases use LSM trees and merge in the background
neat thing: Vertica introduced the idea of having several primary columns. Since it already stores redundant data warehouses, there’s no reason for each instance of the warehouse to have the same primary column

Views and Cubes

Another way to get around the huge amount of raw data is to precompute it. If, for instance, you precompute the total sales per day, you can use the results to power more complex calculations

Materialized Views: A cache of results in the form of a table. A drawback to these is that you need to update them every time you enter new information (write overhead).
Virtual Views: A shortcut for writing queries
Data Cubes: A grid of aggregates grouped by dimention. This makes a lot of queries faster, but it’s not good for edge cases. The edge cases just use the raw data