Notes: Logs
source: Jay Kreps’ blogpost
A lot of the information discussed in this article is relevant to Chapter 11 of Designing Data Intensive Application. My notes on this are here
What’s a Log?
it’s an append-only, totally-ordered sequence of records
- ordered by time
- each entry has a unique ID. Can be used as a timestamp. Allows you to decouple from a physical clock
- aka write-ahead logs, commit logs, transaction logs
The point of a log is to record what happened and when
- extremely important for distributed data systems
it’s important for machines to play back their history at their own rate in a deterministic manner
Logs in databases
Databases use logs to sync their data structures and indexes. Logs are only used internally by the database
- write-ahead log (notes from DDIA here) helps the DB recover after a crash
- Now it’s used as a way to replicate data between databases (notes from DDIA here)
Logs in distributed systems
For distributed systems, the order of inputs is extremely important
- Logs help these systems agree on an ordering of events
- an event’s log ID becomes a universal timestamp
State Machine Replication Principle - identical deterministic processes that get the same inputs in the same order will have the same output
physical vs logical logging - relevant for DB people
- physical: logging the contents of each row that was changed
- logical: log the SQL commands
state machine model each machine processes data based directly on the log
primary-backup model elect a leader to process the log. It pushes changes to followers
A distributed log can be seen as the data structure which models the problem of consensus
Changelog 101: Tables and Events are Dual
logs are like list of transactions. A table just shows the current account balance
- if you have a log, you can create a table
- If you have a changelog (changes to a table), you can create a log
Version control is also like this
Data Integration
data integration making sure your services have access to the data they need
- aka ETL, which is usually limited to data integration in relational data warehouses
- this is a mundane but essential problem for organizations
Not only does this data need to be available, but it needs to be readable
Complications to data integration
- There’s a lot of event data - way more than what we can store in a database
- there are a lot of specialized systems (for search, batch processing, …)
Log-structured data flow
Disadvantages of depending on data warehouses
- processing isn’t reversable, specific to one problem
- if data isn’t entered correctly right away, the data warehouse has bad data
solution: all your organizations data goes into a central log (each logical data source is marked as its own log). Advantages:
- You have a buffer for async data consumption
- destination systems don’t know anything about the data’s origin
- all your systems potentially have access to all of the data
- introducing a new system is straightforward
This is slightly different from a messaging system or pub sub
databus infrastructure that provides log caching abstraction
technologies that do this:
- Kafka
- Kinesis
Relationship to ETL and the Data Warehouse
A data warehouse is (supposed to be) a repository of clean, integrated data to support analysis
- this is great for batch jobs
- sometimes we need clean integrated data for things other than batch jobs
During ETL, the data is:
- clean up
- structured for the data warehouse
One problem of the data warehouse is organizational: The data warehouse team is the only party responsible for getting data cleaned up and into the warehouse
- there’s no incentive for other teams to make this easy
Alternative: if a log is acting as a central pipeline, and it has a well defined API for adding data, the responsibility for providing data in a readable format falls to individual teams
- this way, the data warehouse team focuses on the problem of loading the (already structured) data into the warehouse
- teams clean their data before they put it on the pipeline. This cleanup should be lossless and reversible
- you can enrich data asynchronously
Advantage to this: it becomes easy to add data systems besides warehouses
This architecture enables decoupled, event driven systems
Building a Scalable Log
LinkedIn handles tens of billions of logs every day. How do the do it?
- partition the log
- Each partition is a totally ordered log, but there’s no global ordering between partitions
- this allows systems to append to logs without coordinating
- batch reads and writes
- Kafka does this aggressively
- de-dup
- zero-copy data transfer, in Kafka
Real Time Data Processing
we can view stream processing as infrastructure for continuous data processing
- this depends on how we collect the data. Is it collected continuously or as a batch?
- most real-world use cases are continuous
Data flow graphs
stream processing lets us create feeds computed from other feeds
Advantages for using a log for stream processing: each dataset is multi-subscriber and ordered
Stateful Real-Time Processing
We need to maintain state in the processor to handle sophisticated counts, aggregations and joins. How do we maintain this state?
- DDIA addresses this a little bit (notes)
We could keep state in memory, but this can be fragile. We could also keep the state in a remote machine, but many round-trip calls would be expensive
Alternative: The processor can journal out a changelog
- it can just restore its index from the changelog after a crash
- advantage: other processes can subscribe to this
Log compaction
we have a finite amount of space for logs
for event data, Kafka supports retaining a window of data
for key data, we can remove obsolete records (any record with a more recent update)
Distributed system Design
in a system with a shared log, individual systems can rely on the shared log
- it allows them to reduce complexity
The log captures state changes, the serving nodes just need to store the right index, and subscribe to the log
this separation makes it easy to restore failed nodes or move partitions
- use a combination of snapshots and a fixed window of data