際際滷

際際滷Share a Scribd company logo
Bad/Bed time Stories
Today's Menu

Quick Kafka Overview

Kafka Usage At AppsFlyer

AppsFlyer First Cluster

Designing The Next Cluster: Requirements And Changes

Problems With The New Cluster

Changes To The New Cluster

Traffic Boost, Then More Issues

More Solutions

And More Failures

Splitting The Cluster, More Changes And The Current Configuration

Lessons Learned

Testing The Cluster

Collecting Metrics And Alerting
"A first sign of the beginning of
understanding is the wish to die."
Franz Kafka
Quick Kafka
An open source, distributed,
partitioned and replicated commit-log
based publish- subscribe messaging
system
Kafka Overview
Kafka Overview

Topic: Category which messages are published by the message
producers

Broker: Kafka server process (usually one per node)

Partitions: Topics are partitioned, each partition is represented by the
ordered immutable sequence of messages. Each message in the partition
is assigned a unique ID called offset
Kafka Usage in AppsFlyer
AppsFlyer First cluster

Traffic: Up to few hundreds millions

Size: 4 M1.xlarge brokers

~8 Topics

Replication factor 1

Retention 8-12H

Default number of partitions 8

Vanilla configuration
Main reason for migration: Lack of storage capacity,
limited parallelism due to low partition count and forecast
for future needs.
Requirements for the Next Cluster

More capacity to support
Billions of messages

Messages replication to
prevent data loss

Support loss of brokers up
to entire AZ

Much higher parallelism to
support more consumers

Longer retention period 
48 hours on most topics
The new Cluster changes

18 m1.xlarge brokers, 6 per AZ

Replication factor of 3

All partitions are distributed between AZ

Topics # of partitions increased (between 12 to 120 depends on
parallelism needs)

4 Network and IO threads

Default log retention 48 hours

Auto Leader rebalance enabled

Imbalanced ratio set to default 15%
* Leader: For each partition there is a leader which serve for writes and reads and the other brokers are replicated from
* Imbalance ratio: The highest percentage of leadership a broker can hold, above that auto rebalance is initiate
Glossary
And After a few Months
Problems

Uneven distributions of
leaders which cause
high load on specific
brokers and eventually
lag in consumers and
brokers failures

Constantly rebalanced
of brokers leaders which
caused failures in
python producers
Solutions

Increase number of
brokers to 24 improve
broker leadership
distribution

Rewrite Python
producers in Clojure

Decrease number of
partitions where high
parallelism is not
needed
Traffic increasing and...
Problems

High Iowait in the brokers

Missing ISR due to leaders overloaded

Network bandwidth close to thresholds

Lag in consumers
* ISR: In Active Replicas
Glossary
More Solutions

Split into 2 clusters: launches which contain
80% of messages and all the rest

Move launches cluster to i2.2xlarge with local
SSD

Finer tuning of leaders

Increase number of IO and Network Threads

Enable AWS enhanced networking
And some few more...

Decrease Replication
factor to 2 in Launches
cluster to reduce load
on leaders, reduce disk
capacity and AZ traffic
costs

Move 2nd cluster to
i2.2xlarge as well

Upgrade ZK due to
performance issues
Lessons learned

Minimize replication factor as possible to avoid
extra load on the Leaders

Make sure that leaders count is well balanced
between brokers

Balance partition number to support parallelism

Split cluster logically considering traffic and
business importance

Retention (time based) should be long enough
to recover from failures

In AWS, spread cluster between AZ

Support cluster dynamic changes by clients

Create automation for reassign

Save cluster-reassignment.json of each topic for
future needs!

Don't be to cheap on the Zookeepers
Testing the cluster

Load test using kafka-producer-perf-test.sh &
kafka-consumer-perf-test.sh

Broker failure while running

Entire AZ failure while running

Reassign partitions on the fly

Kafka dashboard contains: Leader election rate,
ISR status, offline partitions count, Log Flush time,
All Topics Bytes in per broker, IOWait, LoadAvg,
Disk Capacity and more

Set appropriate alerts
Collecting metrics & Alerting

Using Airbnb plugin for
Kafka, sending metrics to
graphite

Internal application that
collects Lag for each Topic
and send values to graphite

Alerts are set on Lag (For
each topic, Under replicated
partitions, Broker topic
metrics below threshold,
Leader reelection
kafka

More Related Content

kafka

  • 2. Today's Menu Quick Kafka Overview Kafka Usage At AppsFlyer AppsFlyer First Cluster Designing The Next Cluster: Requirements And Changes Problems With The New Cluster Changes To The New Cluster Traffic Boost, Then More Issues More Solutions And More Failures Splitting The Cluster, More Changes And The Current Configuration Lessons Learned Testing The Cluster Collecting Metrics And Alerting
  • 3. "A first sign of the beginning of understanding is the wish to die." Franz Kafka
  • 5. An open source, distributed, partitioned and replicated commit-log based publish- subscribe messaging system Kafka Overview
  • 6. Kafka Overview Topic: Category which messages are published by the message producers Broker: Kafka server process (usually one per node) Partitions: Topics are partitioned, each partition is represented by the ordered immutable sequence of messages. Each message in the partition is assigned a unique ID called offset
  • 7. Kafka Usage in AppsFlyer
  • 8. AppsFlyer First cluster Traffic: Up to few hundreds millions Size: 4 M1.xlarge brokers ~8 Topics Replication factor 1 Retention 8-12H Default number of partitions 8 Vanilla configuration Main reason for migration: Lack of storage capacity, limited parallelism due to low partition count and forecast for future needs.
  • 9. Requirements for the Next Cluster More capacity to support Billions of messages Messages replication to prevent data loss Support loss of brokers up to entire AZ Much higher parallelism to support more consumers Longer retention period 48 hours on most topics
  • 10. The new Cluster changes 18 m1.xlarge brokers, 6 per AZ Replication factor of 3 All partitions are distributed between AZ Topics # of partitions increased (between 12 to 120 depends on parallelism needs) 4 Network and IO threads Default log retention 48 hours Auto Leader rebalance enabled Imbalanced ratio set to default 15% * Leader: For each partition there is a leader which serve for writes and reads and the other brokers are replicated from * Imbalance ratio: The highest percentage of leadership a broker can hold, above that auto rebalance is initiate Glossary
  • 11. And After a few Months
  • 12. Problems Uneven distributions of leaders which cause high load on specific brokers and eventually lag in consumers and brokers failures Constantly rebalanced of brokers leaders which caused failures in python producers
  • 13. Solutions Increase number of brokers to 24 improve broker leadership distribution Rewrite Python producers in Clojure Decrease number of partitions where high parallelism is not needed
  • 15. Problems High Iowait in the brokers Missing ISR due to leaders overloaded Network bandwidth close to thresholds Lag in consumers * ISR: In Active Replicas Glossary
  • 16. More Solutions Split into 2 clusters: launches which contain 80% of messages and all the rest Move launches cluster to i2.2xlarge with local SSD Finer tuning of leaders Increase number of IO and Network Threads Enable AWS enhanced networking
  • 17. And some few more... Decrease Replication factor to 2 in Launches cluster to reduce load on leaders, reduce disk capacity and AZ traffic costs Move 2nd cluster to i2.2xlarge as well Upgrade ZK due to performance issues
  • 18. Lessons learned Minimize replication factor as possible to avoid extra load on the Leaders Make sure that leaders count is well balanced between brokers Balance partition number to support parallelism Split cluster logically considering traffic and business importance Retention (time based) should be long enough to recover from failures In AWS, spread cluster between AZ Support cluster dynamic changes by clients Create automation for reassign Save cluster-reassignment.json of each topic for future needs! Don't be to cheap on the Zookeepers
  • 19. Testing the cluster Load test using kafka-producer-perf-test.sh & kafka-consumer-perf-test.sh Broker failure while running Entire AZ failure while running Reassign partitions on the fly Kafka dashboard contains: Leader election rate, ISR status, offline partitions count, Log Flush time, All Topics Bytes in per broker, IOWait, LoadAvg, Disk Capacity and more Set appropriate alerts
  • 20. Collecting metrics & Alerting Using Airbnb plugin for Kafka, sending metrics to graphite Internal application that collects Lag for each Topic and send values to graphite Alerts are set on Lag (For each topic, Under replicated partitions, Broker topic metrics below threshold, Leader reelection