source: Jay Kreps’ blogpost

A lot of the information discussed in this article is relevant to Chapter 11 of Designing Data Intensive Application. My notes on this are here

What’s a Log?

it’s an append-only, totally-ordered sequence of records

ordered by time
each entry has a unique ID. Can be used as a timestamp. Allows you to decouple from a physical clock
aka write-ahead logs, commit logs, transaction logs

The point of a log is to record what happened and when

extremely important for distributed data systems

it’s important for machines to play back their history at their own rate in a deterministic manner

Logs in databases

Databases use logs to sync their data structures and indexes. Logs are only used internally by the database

write-ahead log (notes from DDIA here) helps the DB recover after a crash
Now it’s used as a way to replicate data between databases (notes from DDIA here)

Logs in distributed systems

For distributed systems, the order of inputs is extremely important

Logs help these systems agree on an ordering of events
an event’s log ID becomes a universal timestamp

State Machine Replication Principle - identical deterministic processes that get the same inputs in the same order will have the same output

physical vs logical logging - relevant for DB people

physical: logging the contents of each row that was changed
logical: log the SQL commands

state machine model each machine processes data based directly on the log

primary-backup model elect a leader to process the log. It pushes changes to followers

A distributed log can be seen as the data structure which models the problem of consensus

Changelog 101: Tables and Events are Dual

logs are like list of transactions. A table just shows the current account balance

if you have a log, you can create a table
If you have a changelog (changes to a table), you can create a log

Version control is also like this

Data Integration

data integration making sure your services have access to the data they need

aka ETL, which is usually limited to data integration in relational data warehouses
this is a mundane but essential problem for organizations

Not only does this data need to be available, but it needs to be readable

Complications to data integration

There’s a lot of event data - way more than what we can store in a database
there are a lot of specialized systems (for search, batch processing, …)

Log-structured data flow

Disadvantages of depending on data warehouses

processing isn’t reversable, specific to one problem
if data isn’t entered correctly right away, the data warehouse has bad data

solution: all your organizations data goes into a central log (each logical data source is marked as its own log). Advantages:

You have a buffer for async data consumption
destination systems don’t know anything about the data’s origin
all your systems potentially have access to all of the data
introducing a new system is straightforward

This is slightly different from a messaging system or pub sub

databus infrastructure that provides log caching abstraction

technologies that do this:

Kafka
Kinesis

Relationship to ETL and the Data Warehouse

A data warehouse is (supposed to be) a repository of clean, integrated data to support analysis

this is great for batch jobs
sometimes we need clean integrated data for things other than batch jobs

During ETL, the data is:

clean up
structured for the data warehouse

One problem of the data warehouse is organizational: The data warehouse team is the only party responsible for getting data cleaned up and into the warehouse

there’s no incentive for other teams to make this easy

Alternative: if a log is acting as a central pipeline, and it has a well defined API for adding data, the responsibility for providing data in a readable format falls to individual teams

this way, the data warehouse team focuses on the problem of loading the (already structured) data into the warehouse
teams clean their data before they put it on the pipeline. This cleanup should be lossless and reversible
you can enrich data asynchronously

Advantage to this: it becomes easy to add data systems besides warehouses

This architecture enables decoupled, event driven systems

Building a Scalable Log

LinkedIn handles tens of billions of logs every day. How do the do it?

partition the log
- Each partition is a totally ordered log, but there’s no global ordering between partitions
- this allows systems to append to logs without coordinating
batch reads and writes
- Kafka does this aggressively
de-dup
- zero-copy data transfer, in Kafka

Real Time Data Processing

we can view stream processing as infrastructure for continuous data processing

this depends on how we collect the data. Is it collected continuously or as a batch?
most real-world use cases are continuous

Data flow graphs

stream processing lets us create feeds computed from other feeds

Advantages for using a log for stream processing: each dataset is multi-subscriber and ordered

Stateful Real-Time Processing

We need to maintain state in the processor to handle sophisticated counts, aggregations and joins. How do we maintain this state?

DDIA addresses this a little bit (notes)

We could keep state in memory, but this can be fragile. We could also keep the state in a remote machine, but many round-trip calls would be expensive

Alternative: The processor can journal out a changelog

it can just restore its index from the changelog after a crash
advantage: other processes can subscribe to this

Log compaction

we have a finite amount of space for logs

for event data, Kafka supports retaining a window of data

for key data, we can remove obsolete records (any record with a more recent update)

Distributed system Design

in a system with a shared log, individual systems can rely on the shared log

it allows them to reduce complexity

The log captures state changes, the serving nodes just need to store the right index, and subscribe to the log

this separation makes it easy to restore failed nodes or move partitions

use a combination of snapshots and a fixed window of data