狠狠撸

Something about Kafka
Frank Yao
@超?大杯摩卡星冰乐
2013-06-29
13年7月5?日星期五

Agenda
? WHAT is Kafka?
? HOW we use it in Vipshop?
? WHY Kafka is so ‘fast’?

WHAT is Kafka?

WHAT
? “Kafka is a messaging system that was
originally developed at LinkedIn to
serve as the foundation for LinkedIn's
activity stream and operational data
processing pipeline.”

User cases
? Operational monitoring: real-time, heads-up
monitoring
? Reporting and Batch processing: load data into
a data warehouse or Hadoop system

Performance
(Below test is from Kafka website)
? Parameters:
? message size = 200 bytes
? batch size = 200 messages
? fetch size = 1MB
? ?ush interval = 600 messages

Batch Size

Consumer Throughput

Data Size?

Producers thread?

Topic Number?

Tradition Queue
? ActiveMQ, RabbitMQ...

My Test
? Use Flume:
? In/Out ~= 30w message per second

Kafka in Vipshop

Data ‘in’ Kafka
? Operational monitoring
? Nginx access log
? PHP error log, slow log
? Reporting and Batch processing:
? Nginx access log
? PHP error log, slow log
? App log
? b2c
? Recommend
? Pay
? Passport

How many Data?
? Peak Time(10:00~10:30):
? IN : 15k-20k msg per second
? OUT : 30k-40k msg per second

Apps depends on Kakfa

Kibana(Elasticsearch)

real-time pv uv

Load use Kafka

Replace RabbitMQ
RabbitMQ Kafka
Servers
Load
Language
Deployment
Client
Management
RabbitMQ Kafka
6 1
>10 <2.5
Erlang Scala
Dif?cult Easy
A lot Not Many
Web-console JMX

WHY Kafka ‘fast’

Basics
? producers
? consumers
? consumer groups
? brokers

Kafka Arch

Kafka Deployment

Major Design Elements
? Persistent messages
? Throughput >>> features
? Consumers hold states
? ALL is distributed

Detail Agenda
? Maximizing Performance
? Filesystem vs. Memory
? BTree?
? Zero-copy
? End-to-end Batch Compression
? Consumer state
? Message delivery semantics
? Consumer state
? Push vs. Pull
? Message
? Message format
? Disk structure
? Zookeeper
? Directory Structure

Maximize Performance

Filesystem vs. Memory

Who is fast?

Memory
Filesystem

Disk
hardware linear writes random writes
6*7200rpm SATA
RAID-5
300MB/sec 50k/sec

ACM Pieces

Let’s see something REAL

Server Stats

page cache
? use free memory for disk caching to make
random write fast

Drawbacks
? All disk reads and writes will go through this
uni?ed cache. This feature cannot easily be
turned off without using direct I/O, so even if
a process maintains an in-process cache of the
data, this data will likely be duplicated in OS
pagecache, effectively storing everything
twice.

If JVM...

If we use memory(JVM)
? The memory overhead of objects is very
high, often doubling the size of the data
stored (or worse).
? Java garbage collection becomes
increasingly sketchy and expensive as
the in-heap data increases.

cache size
? at least double the available cache by
having automatic access to all free
memory, and likely double again by
storing a compact byte structure rather
than individual objects. Doing so will
result in a cache of up to 28-30GB on a
32GB machine.

comparison
in-disk in-memory
GC
Initialization
Logic
no GC stop the world
stay warm even if
restarted
rebuilt slow(10min for
10GB) and cold cache
handle by OS handle by programs

Conclusion
? using the ?lesystem and relying on
pagecache is superior to maintaining an
in-memory cache or other structure

Go Extreme!
? Write to ?lesystem DIRECTLY!
? (In effect this just means that it is transferred
into the kernel's pagecache where the OS
can ?ush it later.)

Furthermore
? You can con?gure: every N messages or
every M seconds. It is to put a bound on
the amount of data "at risk" in the event
of a hard crash.
? Varnish use pagecache-centric design as
well.

BTree

Background
? Messaging system meta is often a BTree.
? BTree operations are O(logN).

BTree
? O(logN) ~= constant time

BTree is slow on Disk!

BTree for Disk
? Disk seeks come at 10 ms a pop
? each disk can do only one seek at a time
? parallelism is limited
? the observed performance of tree
structures is often super-linear

Lock
? Page or row locking to avoid lock the
tree

Two Facts
? no advantage of driver density because
of the heavy reliance on disk seek
? need small (< 100GB) high RPM SAS
drives to maintain a sane ratio of data
to seek capacity

Use Log ?le Structure!

Feature
? One queue is one log ?le
? Operations is O(1)
? Reads do not block writes or each other
? Decouple with data size
? Retain messages after consumption

zero-copy

1. The operating system reads data from the disk
into pagecache in kernel space
2. The application reads the data from kernel
space into a user-space buffer
3. The application writes the data back into
kernel space into a socket buffer
4. The operating system copies the data from the
socket buffer to the NIC buffer where it is sent
over the network

zerocopy
? data is copied into pagecache exactly
once and reused on each consumption
instead of being stored in memory and
copied out to kernel space every time it
is read

zerocopy performance

End-to-end Batch
Compression
Maximizing Performance

Consider that
C1
C2
C3
P1
P2
2*compression+
3*de-compression
M=num(P)
N=num(C)
M*compression+
N*de-compression

Key point
? End-to-end: compress by producers and
de-compress by consumers
? Batch: compression aims to compress a
‘message set’
? Kafka supports GZIP and Snappy
protocols

Consumer State

Facts
? No ACK
? Consumers maintain the message state

Features
? Message is in a partition
? Stored and given out in the order they
arrive
? ‘ watermark’ - ‘offset’ in Kafka

track state
? write msg state in zookeeper
? in one transaction with writing data
? side bene?t: ‘rewind’ msg

Screenshot

push vs. pull
Consumer State

push system
? if a consumer is <defunct>?

Kafka use pull model

Message
Format & Data structure

Msg Format
? N byte message:
? If magic byte is 0
1. 1 byte "magic" identi?er to allow format changes
2. 4 byte CRC32 of the payload
3. N - 5 byte payload
? If magic byte is 1
1. 1 byte "magic" identi?er to allow format changes
2. 1 byte "attributes" identi?er to allow annotations on the message independent of the
version (e.g. compression enabled, type of codec used)
3. 4 byte CRC32 of the payload
4. N - 6 byte payload

Log format on-disk
? On-disk format of a message
? message length : 4 bytes (value: 1+4+n)
? ‘magic’ value : 1 byte
? crc : 4 bytes
? payload : n bytes
? partition id and node id to uniquely identify a
message

Kafka Log Implementation

Writes
Message

Writes
? Append-write
? When rotate:
? M : M messages in a log ?le
? S : S seconds after last ?ush
? Durability guarantee: losing at most M
messages or S seconds of data in the
event of a system crash

Reads
Message

Buffer Reads
? auto double buffer size
? you can specify the max buffer size

Offset Search
? Search steps:
1. locating the log segment ?le in which
the data is stored
2. calculating the ?le-speci?c offset from
the global offset value
3. reading from that ?le offset
? Simple binary in memory

Features
? Reset the offset
? OutOfRangeException(problem we
met)

Deletes
Message

Deletes
? Policy: N days ago or N GB
? Deleting while reading?
? a copy-on-write style segment list
implementation that provides
consistent views to allow a binary
search to proceed on an immutable
static snapshot view of the log
segments

Zookeeper

Directory Structure
Zookeeper

Broker Node
? /brokers/ids/[0...N] --> host:port (ephemeral node)

Broker Topic
? /brokers/topics/[topic]/[0...N] --> nPartions (ephemeral node)

Consumer Id
? /consumers/[group_id]/ids/[consumer_id] --> {"topic1":
#streams, ..., "topicN": #streams} (ephemeral node)

Consumer Offset Tracking
? /consumers/[group_id]/offsets/[topic]/[broker_id-partition_id] -->
offset_counter_value ((persistent node)

Partition Owner
? /consumers/[group_id]/owners/[topic]/[broker_id-partition_id] -->
consumer_node_id (ephemeral node)

Why Kafka fast?
? Maximizing Performance
? Filesystem vs. Memory
? BTree?
? Zero-copy
? End-to-end Batch Compression
? Consumer state
? Message delivery semantics
? Consumer state
? Push vs. Pull
? Message
? Message format
? Disk structure
? Zookeeper
? Directory Structure

Thank You!

狠狠撸

Something about Kafka - Why Kafka is so fast

Recommended

More Related Content

What's hot (20)

Similar to Something about Kafka - Why Kafka is so fast (20)

Something about Kafka - Why Kafka is so fast