source: Kafka: The Definitive Guide by Neha Narkhede, Gwen Shapira, Todd Palino (Chapter 1)

Publish/Subscribe Messaging

message - a base unit of information. Often has a category
publisher - sends and classifies messages, but does not direct them anywhere
subscriber - receives certain classes of messages
broker - central point where messages are published to

Enter Kafka

a “distributed commit log”, or a “distributed streaming platform”

Designed to store data durably and in order. This means that it can be read deterministically

Another key feature of Kafka is that the data is distributed

Messages and Batches

message - unit of data within Kafka. Kafka has no knowledge about the content of the message

key - an optional component of a message. Used to control how messages are written to partition

batch - collection of messages being produced to the same topic and partition

larger batches lead to messages being handled more efficiently, but individual messages take longer to propagate

Schemas

Optional, but recommended. Allow engineers to decouple writing and reading messages

human-readable schemas (JSON, XML) lack type handling and schema versions
alternatives, like Apache Avro, have compact serialization formats, type handling and separate schemas

Topics & Partitions

topics - how messages in Kafka are organized. Can be broken down into partitions

partitions - a single log that messages are appended to

can be hosted on different servers
time ordering is guaranteed within a partition but not a topic

stream - represent data moving from the producers to consumers for a given topic

Producers & Consumers

Kafka clients can be users or consumers

Kafka Connect API - advanced client API for data integration
Kafka Streams - for stream processing

producers

create new messages. Generally publish messages to specific topics and usually don’t care which partition the message goes to

consumers

consume messages. Can be subscribed to one or more topics, reads messages in order

offset - metadata (a counter) given to each message within a partition

a consumer will use this to keep track of which messages it’s already consumed

consumer group - a set of consumers reading from a topic. Each partition is read by a single consumer

ownership - mapping of a consumer to a partition

Brokers & Clusters

A broker is a single Kafka server

it receives messages from producers, assigns offsets to them, and commits them to disk
services fetch requests from consumers
owns a partition

A cluster is a group of brokers

A controller is responsible for administrative operations

A broker that is the leader of a partition is responsible for receiving messages from producers and for replicating the partition to other brokers in its cluster

retention - durable storage of messages

can be for a period of time or dictated by a data limit
expired messages are deleted

Multiple Clusters

You might want multiple clusters for

data segregation
security isolation
redundancy

Mirror Maker - like a mega-kafka. Messages are consumed from one cluster and produced for another

Why Kafka?

multiple producers and consumers
data is easy to retain
scalable and performant

Use Cases for Kafka

activity tracking
messaging
metrics & logging
stream processing

Kafka’s Origin

created at LinkedIn to address a data pipeline problem

released as an open source project on GitHub in 2010

It’s named after Franz Kafka, but for no particular reason. Just because it sounded cool