�ݺ�ߣ

Intra-cluster Replication for
Apache Kafka
Jun Rao

About myself
• Engineer at LinkedIn since 2010
• Worked on Apache Kafka and Cassandra
• Database researcher at IBM

Outline
• Overview of Kafka
• Kafka architecture
• Kafka replication design
• Performance
• Q/A

What’s Kafka
• A distributed pub/sub messaging system
• Used in many places
– LinkedIn, Twitter, Box, FourSquare …
• What do people use it for?
– log aggregation
– real-time event processing
– monitoring
– queuing

Example Kafka Apps at LinkedIn

Kafka Deployment at LinkedIn
Live data center Offline data center

Live Live Live
service service service

interactive data
(human, machine)

Monitorin
g

Kafka Kafka
Kafka Hadoop
Hadoop
Kafka
Kafka Kafka Hadoop

Per day stats
• writes: 10+ billion messages (2+TB compressed data)
• reads: 50+ billion messages

Kafka vs. Other Messaging Systems
• Scale-out from groundup
• Persistence to disks
• High throughput (10s MB/sec per server)
• Multi-subscription

Kafka Architecture
Producer Producer

Broker Broker Zookeeper Broker Broker

Consumer Consumer

Terminologies
• Topic = message stream
• Topic has partitions
– partitions distributed to brokers
• Partition has a log on disk
– message persisted in log
– message addressed by offset

API
• Producer
messages = new List<KeyedMessage<K,V>>();
messages.add(newKeyedMessage(“topic1”, null, “msg1”);
send(messages);

• Consumer
streams[] = Consumer.createMessageStream(“topic1”, 1);

for(message: streams[0]) {
// do something with message
}

Deliver High Throughput
• Simple storage
logs in broker
msg-1
msg-2
topic1:part1 topic2:part1 msg-3
msg-4 index
segment-1 segment-1 msg-5
…
…
segment-2 segment-2 msg-n

read()
segment-n segment-n
append()

• Batched writes and reads
• Zero-copy transfer from file to socket
• Compression (batched)

Why Replication
• Broker can go down
– controlled: rolling restart for code/config push
– uncontrolled: isolated broker failure
• If broker down
– some partitions unavailable
– could be permanent data loss
• Replication  higher availability and
durability

CAP Theorem
• Pick two from
– consistency
– availability
– network partitioning

Kafka Replication: Pick CA
• Brokers within a datacenter
– i.e., network partitioning is rare
• Strong consistency
– replicas byte-wise identical
• Highly available
– typical failover time: < 10ms

Replicas and Layout
• Partition has replicas
• Replicas spread evenly among brokers

logs logs logs logs

topic1-part1 topic1-part2 topic2-part1 topic2-part2



broker 1 broker 2 broker 3 broker 4

Maintain Strongly Consistent Replicas
• One of the replicas is leader
• All writes go to leader
• Leader propagates writes to followers in order
• Leader decides when to commit message

Conventional Quorum-based Commit
• Wait for majority of replicas (e.g. Zookeeper)
• Plus: good latency
• Minus: 2f+1 replicas  tolerate f failures
– ideally want to tolerate 2f failures

Commit Messages in Kafka
• Leader maintains in-sync-replicas (ISR)
– initially, all replicas in ISR
– message committed if received by ISR
– follower fails  dropped from ISR
– leader commits using new ISR
• Benefit: f replicas  tolerate f-1 failures
– latency less an issue within datacenter

Data Flow in Replication
producer
2
ack 1
2
leader follower follower
3

commit
4
topic1-part1 topic1-part1 topic1-part1
consumer

broker 1 broker 2 broker 3

When producer receives ack Latency Durabilityon failures
no ack no network delay some data loss
wait for leader 1 network roundtrip a few data loss
wait for committed 2 network roundtrips no data loss

Only committed messages exposed to consumers
• independent of ack type chosen by producer

Extend to Multiple Partitions
producer

leader follower follower
producer

leader follower follower producer

follower follower leader

broker 1 broker 2 broker 3 broker 4

• Leaders are evenly spread among brokers

Handling Follower Failures
• Leader maintains last committed offset
– propagated to followers
– checkpointed to disk
• When follower restarts
– truncate log to last committed
– fetch data from leader
– fully caught up  added to ISR

Handling Leader Failure
• Use an embedded controller (inspired by Helix)
– detect broker failure via Zookeeper
– on leader failure: elect new leader from ISR
– committed messages not lost
• Leader and ISR written to Zookeeper
– for controller failover
– expected to change infrequently

Example of Replica Recovery
1. ISR = {A,B,C}; Leader A commits message m1;
L (A) F (B) F (C)
m1 m1 m1
last committed m2
m2
m3

2. A fails and B is new leader; ISR = {B,C}; B commits m2, but not m3
L (A) L (B) F (C)
m1 m1 m1
m2 m2 m2
m3

3. B commits new messages m4, m5
L (A) L (B) F (C)
m1 m1 m1
m2 m2 m2
m3 m4 m4
m5 m5

4. A comes back, truncates to m1 and catches up; finally ISR = {A,B,C}
F (A) L (B) F (C) F (A) L (B) F (C)
m1 m1 m1 m1 m1 m1
m2 m2 m2 m2 m2
m4 m4 m4 m4 m4
m5 m5 m5 m5 m5

Setup
• 3 brokers
• 1 topic with 1 partition
• Replication factor=3
• Message size = 1KB

Choosing btw Latency and Durability

When producer Time to publish Durabilityon
receives ack a message (ms) failures
no ack 0.29 some data loss
wait for leader 1.05 a few data loss
wait for committed 2.05 no data loss

Producer Throughput

varying messages per send varying # concurrent producers
70 70
60 60
50 50
MB/s

MB/s
40 40
no ack no ack
30 30
20 leader 20 leader
10 committed 10 committed
0 0
1 10 100 1000 1 5 10 20
messages per send # producers

Consumer Throughput

throughput vs fetch size
100

80

60
MB/s

40

20

0
1KB 10KB 100KB 1MB
fetch size

Q/A
• Kafka 0.8.0 (intra-cluster replication)
– expected to be released in Mar
– various performance improvements in the future
• Checkout more about Kafka
– http://kafka.apache.org/
• Kafka meetup tonight

�ݺ�ߣ

Kafka replication apachecon_2013

More Related Content

What's hot (20)

Similar to Kafka replication apachecon_2013 (20)

Kafka replication apachecon_2013