Notes: Kafka Introduction
source: Kafka: The Definitive Guide by Neha Narkhede, Gwen Shapira, Todd Palino (Chapter 1)
Publish/Subscribe Messaging
- message - a base unit of information. Often has a category
- publisher - sends and classifies messages, but does not direct them anywhere
- subscriber - receives certain classes of messages
- broker - central point where messages are published to
Enter Kafka
a “distributed commit log”, or a “distributed streaming platform”
Designed to store data durably and in order. This means that it can be read deterministically
Another key feature of Kafka is that the data is distributed
Messages and Batches
message - unit of data within Kafka. Kafka has no knowledge about the content of the message
key - an optional component of a message. Used to control how messages are written to partition
batch - collection of messages being produced to the same topic and partition
- larger batches lead to messages being handled more efficiently, but individual messages take longer to propagate
Schemas
Optional, but recommended. Allow engineers to decouple writing and reading messages
- human-readable schemas (JSON, XML) lack type handling and schema versions
- alternatives, like Apache Avro, have compact serialization formats, type handling and separate schemas
Topics & Partitions
topics - how messages in Kafka are organized. Can be broken down into partitions
partitions - a single log that messages are appended to
- can be hosted on different servers
- time ordering is guaranteed within a partition but not a topic
stream - represent data moving from the producers to consumers for a given topic
Producers & Consumers
Kafka clients can be users or consumers
- Kafka Connect API - advanced client API for data integration
- Kafka Streams - for stream processing
producers
create new messages. Generally publish messages to specific topics and usually don’t care which partition the message goes to
consumers
consume messages. Can be subscribed to one or more topics, reads messages in order
offset - metadata (a counter) given to each message within a partition
- a consumer will use this to keep track of which messages it’s already consumed
consumer group - a set of consumers reading from a topic. Each partition is read by a single consumer
ownership - mapping of a consumer to a partition
Brokers & Clusters
A broker is a single Kafka server
- it receives messages from producers, assigns offsets to them, and commits them to disk
- services fetch requests from consumers
- owns a partition
A cluster is a group of brokers
A controller is responsible for administrative operations
A broker that is the leader of a partition is responsible for receiving messages from producers and for replicating the partition to other brokers in its cluster
retention - durable storage of messages
- can be for a period of time or dictated by a data limit
- expired messages are deleted
Multiple Clusters
You might want multiple clusters for
- data segregation
- security isolation
- redundancy
Mirror Maker - like a mega-kafka. Messages are consumed from one cluster and produced for another
Why Kafka?
- multiple producers and consumers
- data is easy to retain
- scalable and performant
Use Cases for Kafka
- activity tracking
- messaging
- metrics & logging
- stream processing
Kafka’s Origin
created at LinkedIn to address a data pipeline problem
released as an open source project on GitHub in 2010
It’s named after Franz Kafka, but for no particular reason. Just because it sounded cool