際際滷

際際滷Share a Scribd company logo
Self-hosting Kafka at Scale
Netflixs Journey & Challenges
Piyush Goyal, Staff Engineer, Data Platform
Nick Mahilani, Staff Engineer, Data Platform
Current 2024
Thank you for being here!
RAISE YOUR HAND
IF YOU USE KAFKA IN YOUR ORGANIZATION
KEEP YOUR HAND UP
IF YOU ARE SELF-HOSTING APACHE KAFKA
(NOT using a Kafka service provider)
WHAT CAN YOU EXPECT FROM THIS SESSION?
 How Netflix leverages Kafka to unlock various use-cases ?
 Our Long Journey with Kafka
 How we operate Kafka today ?
 Challenges and learnings
 Business Context
 Keystone Platform (2015)
 Evolution to Composable Architecture
 Kafka as a Service (2021)
 KaaS Features and Architecture
 KaaS Learnings
Our Journey With Kafka
Netflix Scale
Devices
>1,000,000,000
Countries
>190
* August 2024
Members
>278,000,000
Microservices Ecosystem
 Systems at our scale generate a
lot of data
 This data needs to be
transported to where it can be
processed and analysed
Centralized Event Pipeline (2015)
The System should have the following characteristics:
 Easy to use
 Highly Available
 Scalable
 Near Real-Time
Centralized Event Pipeline (2015)
The System should have the following characteristics:
 Easy to use
 Highly Available
 Scalable
 Near Real-Time
This gave rise to Netflixs Keystone Platform in 2015
 Business Context
 Keystone Platform (2015)
 Evolution to Composable Architecture
 Kafka as a Service (2021)
 KaaS Features and Architecture
 KaaS Learnings
Our Journey With Kafka
Keystone Platform (2015)
 Highly abstracted product
 Data Movement to Sinks
 Simple Real-time processing (Filter, Projection)
 Client Library, UI, Management plane, and Data Plane
 Used Apache Kafka and Apache Flink under the hood
Keystone - User Interface
Keystone Platform
Event
Producers
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Event
Producers
Keystone
Management
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Fronting
Kafka
Event
Producers
Keystone
Management
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Fronting
Kafka
Router/
Processor
Event
Producers
Keystone
Management
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Fronting
Kafka
Router/
Processor
Event
Producers
Keystone
Management
Publish events with
keystone client library
(Kafka-agnostic)
Keystone Platform
Fronting
Kafka
Router/
Processor
Event
Producers
Keystone
Management
Stream
Consumers
Consumer
Kafka
Publish events with
keystone client library
(Kafka-agnostic)
FRONTING CONSUMER
 Multi-tenant clusters
 Used to publish data
 Abstracted from producers
 Controlled Cluster access
 Critical for High availability
 Larger Fleet
 Multi-tenant clusters
 Used to consume data
 Coupled with consumers
 Smaller Fleet
Two types of Kafka Clusters
Resilience to cluster failure
Keystone Client
Stream Cluster Topic
playback_events Cluster A playback_events
ad_events Cluster B ad_events
Topic lookup
Cluster A
Cluster B
Topic:
playback_events
Fronting
Topic:
ad_events
Resilience to cluster failure
Keystone Client
Stream Cluster Topic
playback_events Cluster A
Cluster B
playback_events
ad_events Cluster B ad_events
Topic lookup
Cluster A
Cluster B
Topic:
playback_events
Fronting
Topic:
ad_events
Topic:
playback_events
 Things worked well..
 Highly abstracted and easy to use product
 Only takes a couple minutes to create simple data pipelines
 Huge adoption - more than 6000 data pipelines
 >100M message per seconds (>150GB/s)
 Quick real-time transformations like filtering and projection
Not everything worked well 
 For Streaming-only consumers, It was highly inefficient
 Unnecessary hops
 Higher latency
 Extra Cost
 Noisy neighbors in a multi-tenanted environment
 No direct access to Kafka for producers
 Administration of Kafka was semi-automated
And we needed more..
 Highly abstracted product means limited functionality done well
 Solved 80% use-cases, what about the rest?
 New Business Requirements demanded more functionality
 Event Driven Architecture
 Change Data Capture
 Low latency use-cases
 Custom Stream Processing
 Direct Kafka integration for Third party tools
 Business Context
 Keystone Platform (2015)
 Evolution to Composable Architecture
 Kafka as a Service (2021)
 KaaS Features and Architecture
 KaaS Learnings
Our Journey With Kafka
Closed System
Pipeline Abstraction
Pipeline Abstraction
Kafka
as a
Service
Stream
Processing
Composable System
Architecture Evolution
Whether to build or buy?
 We evaluated the tradeoffs for our situation (Year 2020-21)
 Customizability
 Long term costs
 Available in-house expertise
 Minimize Risks
After careful consideration, we decided to BUILD our own managed Kafka
Platform. YMMV!
 Business Context
 Keystone Data Pipeline (2015)
 Evolution to Composable Architecture
 Kafka as a Service (2021)
 KaaS Architecture
 KaaS Learnings
Our Journey With Kafka
Kafka as a Service (KaaS)
Alerting & Auto Remediation Security & Access Control
Observability
Client Library Schema Management
Provisioning
SHARED
v/s
DEDICATED
Provisioning Kafka Clusters
Provisioning Kafka Clusters
 High-availability
 Replication factor = 2
 Min insync replicas = 1
 Unclean leader election enabled
 Strong Consistency
 Replication factor = 3
 Min insync replicas = 2
 Unclean leader election disabled
Kafka Cluster Configuration
Access Control
Audit Log
Admin Operations
 Business Context
 Keystone Data Pipeline (2015)
 Evolution to Composable Architecture
 Kafka as a Service (2021)
 KaaS Architecture
 KaaS Learnings
Our Journey With Kafka
KaaS Architecture
KaaS Architecture
KaaS Architecture
KaaS Scale
190 million messages / second
150+ GB ingested / second
8+ PB persisted state
475+ dedicated Kafka Clusters
11,500 Kafka brokers
35,000 Kafka topics
 Business Context
 Keystone Data Pipeline (2015)
 Evolution to Composable Architecture
 Kafka as a Service (2021)
 KaaS Architecture
 KaaS Learnings
Our Journey With Kafka
1. Scaling a single Kafka Cluster
Scaling Up a Cluster before KaaS
Topic partition counts were tightly coupled with number of brokers
Using OSS Cruise Control:
Topic partition counts independent of number of brokers
Scaling Up a Cluster in KaaS
2. Making Cluster Upgrades Faster
Upgrade time v/s State Size
Kafka Broker Instance
Unit of Change: AWS EC2 instance
Kafka Fleet Upgrades
Upgrade Time
(old strategy)
Desired
Upgrade Time
Upgrade
Frequency
Hardware Upgrade 3+ months < 1 month annually
Software Upgrade 3+ months < 1 week monthly
Software Upgrade Strategy #1
Leverage Amazon Elastic Block Store (EBS)
Source: https://aws.amazon.com/ebs/
Move Kafka state from local instance storage to EBS
Software Upgrade Strategy #1
 EBS is expensive at large scale
 Moved large scale clusters back to AWS instance types
with local disk
 Back to where we started  longer upgrade times 
EBS is awesome but ..
How can we upgrade faster without EBS?
How can we upgrade faster without EBS?
AWS
Replace
Root
Volume
Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/replace-root.html
AWS Replace Root Volume to upgrade AMI
Software Upgrade Strategy #2
Kafka Fleet Upgrades with
Replace Root Volume Strategy
Current
Upgrade Time
Desired
Upgrade Time
Upgrade
Frequency
Hardware Upgrade 1+ month < 1 month annually
Software Upgrade 5 days < 1 week monthly
3. Cost Efficiency
Right Sizing a Kafka Cluster
Num Consumers
Throughput
Replication Factor
Retention
 Which EC2 instance type?
 How many instances?
 How much disk?
Right Sizing a Kafka Cluster
Num Consumers
Throughput
Replication Factor
Retention
Kafka Capacity
Model
Num
Brokers
Instance Type Cost
3 i3en.2xl $
3 i4i.2xl $$
6 r5.4xl + EBS $$$
https://github.com/Netflix-Skunkworks/service-capacity-modeling/blob/main/service_capacity_modeling/models/org/netflix/kafka.py
 Business Context
 Keystone Data Pipeline (2015)
 Evolution to Composable Architecture
 Kafka as a Service (2021)
 KaaS Features and Architecture
 KaaS Learnings
Our Journey With Kafka
Composable architectures are easier to scale and evolve
with the business
Key Takeaway
Closed System
Pipeline Abstraction
Pipeline
Abstraction
Kafka
as a
Service
Stream
Processing
Composable
System
Q & A
Self-hosting Kafka at Scale
Netflixs Journey & Challenges
Piyush Goyal Nick Mahilani
 S3 Flash Bootloader (precursor to AWS Replace Root
Volume)
 Joeys talk on Capacity Plan optimally in the cloud
 Kyle and JS talk on Iterating faster on Stateful Services in
the cloud
References

More Related Content

Similar to Self-hosting Kafka at Scale: Netflix's Journey & Challenges (20)

Elastically Scaling Kafka Using Confluent
Elastically Scaling Kafka Using ConfluentElastically Scaling Kafka Using Confluent
Elastically Scaling Kafka Using Confluent
confluent
Twitters Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitters Apache Kafka Adoption Journey | Ming Liu, TwitterTwitters Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitters Apache Kafka Adoption Journey | Ming Liu, Twitter
HostedbyConfluent
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
confluent
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and KnativeBuild and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Omar Al-Safi
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
New Features in Confluent Platform 6.0 / Apache Kafka 2.6New Features in Confluent Platform 6.0 / Apache Kafka 2.6
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
Kai W辰hner
Strategies For Migrating From SQL to NoSQL The Apache Kafka Way
Strategies For Migrating From SQL to NoSQL  The Apache Kafka WayStrategies For Migrating From SQL to NoSQL  The Apache Kafka Way
Strategies For Migrating From SQL to NoSQL The Apache Kafka Way
ScyllaDB
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
HostedbyConfluent
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
aspyker
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
HostedbyConfluent
Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...
Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...
Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...
Kai W辰hner
Bridge to Cloud: Using Apache Kafka to Migrate to AWS
Bridge to Cloud: Using Apache Kafka to Migrate to AWSBridge to Cloud: Using Apache Kafka to Migrate to AWS
Bridge to Cloud: Using Apache Kafka to Migrate to AWS
confluent
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdfDIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
confluent
Reinventing Kafka in the Data Streaming Era - Jun Rao
Reinventing Kafka in the Data Streaming Era - Jun RaoReinventing Kafka in the Data Streaming Era - Jun Rao
Reinventing Kafka in the Data Streaming Era - Jun Rao
confluent
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
All Things Open
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
NETWAYS
Pivotal CloudFoundry on Google cloud platform
Pivotal CloudFoundry on Google cloud platformPivotal CloudFoundry on Google cloud platform
Pivotal CloudFoundry on Google cloud platform
Ronak Banka
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
Monal Daxini
Elastically Scaling Kafka Using Confluent
Elastically Scaling Kafka Using ConfluentElastically Scaling Kafka Using Confluent
Elastically Scaling Kafka Using Confluent
confluent
Twitters Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitters Apache Kafka Adoption Journey | Ming Liu, TwitterTwitters Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitters Apache Kafka Adoption Journey | Ming Liu, Twitter
HostedbyConfluent
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache KafkaWestpac Bank Tech Talk 1: Dive into Apache Kafka
Westpac Bank Tech Talk 1: Dive into Apache Kafka
confluent
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and KnativeBuild and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Build and Deploy Cloud Native Camel Quarkus routes with Tekton and Knative
Omar Al-Safi
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
New Features in Confluent Platform 6.0 / Apache Kafka 2.6New Features in Confluent Platform 6.0 / Apache Kafka 2.6
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
Kai W辰hner
Strategies For Migrating From SQL to NoSQL The Apache Kafka Way
Strategies For Migrating From SQL to NoSQL  The Apache Kafka WayStrategies For Migrating From SQL to NoSQL  The Apache Kafka Way
Strategies For Migrating From SQL to NoSQL The Apache Kafka Way
ScyllaDB
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Monal Daxini
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Curr...
HostedbyConfluent
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
aspyker
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
HostedbyConfluent
Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...
Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...
Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...
Kai W辰hner
Bridge to Cloud: Using Apache Kafka to Migrate to AWS
Bridge to Cloud: Using Apache Kafka to Migrate to AWSBridge to Cloud: Using Apache Kafka to Migrate to AWS
Bridge to Cloud: Using Apache Kafka to Migrate to AWS
confluent
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdfDIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
confluent
Reinventing Kafka in the Data Streaming Era - Jun Rao
Reinventing Kafka in the Data Streaming Era - Jun RaoReinventing Kafka in the Data Streaming Era - Jun Rao
Reinventing Kafka in the Data Streaming Era - Jun Rao
confluent
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
All Things Open
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
stackconf 2020 | The path to a Serverless-native era with Kubernetes by Paolo...
NETWAYS
Pivotal CloudFoundry on Google cloud platform
Pivotal CloudFoundry on Google cloud platformPivotal CloudFoundry on Google cloud platform
Pivotal CloudFoundry on Google cloud platform
Ronak Banka
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
Monal Daxini

Recently uploaded (20)

Profisee - HIMSS workshop - Mar 2025 - final.pptx
Profisee - HIMSS workshop - Mar 2025 - final.pptxProfisee - HIMSS workshop - Mar 2025 - final.pptx
Profisee - HIMSS workshop - Mar 2025 - final.pptx
Profisee
How managers can learn how to lead with humour
How managers can learn how to lead with humourHow managers can learn how to lead with humour
How managers can learn how to lead with humour
analystcoupons8q
AAC2025_Baptiste Grand_Der OKR.X Guide.pdf
AAC2025_Baptiste Grand_Der OKR.X Guide.pdfAAC2025_Baptiste Grand_Der OKR.X Guide.pdf
AAC2025_Baptiste Grand_Der OKR.X Guide.pdf
Agile Austria Conference
Globibo Book Translation: Connect with Readers in Any Language
Globibo Book Translation: Connect with Readers in Any LanguageGlobibo Book Translation: Connect with Readers in Any Language
Globibo Book Translation: Connect with Readers in Any Language
globibo
AI PRESENTATION.pptxfvvvvvvvvvvhbbbvvbbv
AI PRESENTATION.pptxfvvvvvvvvvvhbbbvvbbvAI PRESENTATION.pptxfvvvvvvvvvvhbbbvvbbv
AI PRESENTATION.pptxfvvvvvvvvvvhbbbvvbbv
xdlmao561
Scotland's Museums and Galleries Strategy Symposium 2025 - climate action wor...
Scotland's Museums and Galleries Strategy Symposium 2025 - climate action wor...Scotland's Museums and Galleries Strategy Symposium 2025 - climate action wor...
Scotland's Museums and Galleries Strategy Symposium 2025 - climate action wor...
Museums Galleries Scotland
Johari window introduction to identifying the personality attributes
Johari window introduction to identifying the personality attributesJohari window introduction to identifying the personality attributes
Johari window introduction to identifying the personality attributes
yogitabharatmandhany
2025-03-02 FATC 01 Annas, Caiaphas & The Sanhedrin (shared slides).pptx
2025-03-02 FATC 01 Annas, Caiaphas & The Sanhedrin (shared slides).pptx2025-03-02 FATC 01 Annas, Caiaphas & The Sanhedrin (shared slides).pptx
2025-03-02 FATC 01 Annas, Caiaphas & The Sanhedrin (shared slides).pptx
Dale Wells
Scotland's Museums and Galleries Strategy Symposium 2025 - diversity workshop...
Scotland's Museums and Galleries Strategy Symposium 2025 - diversity workshop...Scotland's Museums and Galleries Strategy Symposium 2025 - diversity workshop...
Scotland's Museums and Galleries Strategy Symposium 2025 - diversity workshop...
Museums Galleries Scotland
AAC2025_Danninger_Fail fast succeed smarter.pdf
AAC2025_Danninger_Fail fast succeed smarter.pdfAAC2025_Danninger_Fail fast succeed smarter.pdf
AAC2025_Danninger_Fail fast succeed smarter.pdf
Agile Austria Conference
Scotland's Museums and Galleries Strategy Symposium 2025 - education workshop...
Scotland's Museums and Galleries Strategy Symposium 2025 - education workshop...Scotland's Museums and Galleries Strategy Symposium 2025 - education workshop...
Scotland's Museums and Galleries Strategy Symposium 2025 - education workshop...
Museums Galleries Scotland
the el filibusterismo reporting lesson 3.4
the el filibusterismo reporting lesson 3.4the el filibusterismo reporting lesson 3.4
the el filibusterismo reporting lesson 3.4
riverajaysongit446
AMFI-Investor-Awareness-Presentation.4ee88c759f895e1be65d.pdf
AMFI-Investor-Awareness-Presentation.4ee88c759f895e1be65d.pdfAMFI-Investor-Awareness-Presentation.4ee88c759f895e1be65d.pdf
AMFI-Investor-Awareness-Presentation.4ee88c759f895e1be65d.pdf
sakivvikas86
FIFA Friendly Match at Alberni Valley - Strategic Plan.pptx
FIFA Friendly Match at Alberni Valley - Strategic Plan.pptxFIFA Friendly Match at Alberni Valley - Strategic Plan.pptx
FIFA Friendly Match at Alberni Valley - Strategic Plan.pptx
abuhasanjahangir
Isaiah Scudder Dealing with Stress.pptx
Isaiah Scudder  Dealing with Stress.pptxIsaiah Scudder  Dealing with Stress.pptx
Isaiah Scudder Dealing with Stress.pptx
FamilyWorshipCenterD
Satoshi Nakamoto - True Identity Revealed
Satoshi Nakamoto - True Identity RevealedSatoshi Nakamoto - True Identity Revealed
Satoshi Nakamoto - True Identity Revealed
Mike Hydes
RBC_Indices_Presentation_physiology (1).pptx
RBC_Indices_Presentation_physiology (1).pptxRBC_Indices_Presentation_physiology (1).pptx
RBC_Indices_Presentation_physiology (1).pptx
AyushSharma546188
It's a great presentation for everything
It's a great presentation for everythingIt's a great presentation for everything
It's a great presentation for everything
NonalynMagdagasang1
Science Communication beyond Journal Publications Workshop
Science Communication beyond Journal Publications WorkshopScience Communication beyond Journal Publications Workshop
Science Communication beyond Journal Publications Workshop
WAIHIGA K.MUTURI
Heraldry Gold's Whiteburn Gold Project (PDAC, March 2025)
Heraldry Gold's Whiteburn Gold Project (PDAC, March 2025)Heraldry Gold's Whiteburn Gold Project (PDAC, March 2025)
Heraldry Gold's Whiteburn Gold Project (PDAC, March 2025)
RonHawkes1
Profisee - HIMSS workshop - Mar 2025 - final.pptx
Profisee - HIMSS workshop - Mar 2025 - final.pptxProfisee - HIMSS workshop - Mar 2025 - final.pptx
Profisee - HIMSS workshop - Mar 2025 - final.pptx
Profisee
How managers can learn how to lead with humour
How managers can learn how to lead with humourHow managers can learn how to lead with humour
How managers can learn how to lead with humour
analystcoupons8q
AAC2025_Baptiste Grand_Der OKR.X Guide.pdf
AAC2025_Baptiste Grand_Der OKR.X Guide.pdfAAC2025_Baptiste Grand_Der OKR.X Guide.pdf
AAC2025_Baptiste Grand_Der OKR.X Guide.pdf
Agile Austria Conference
Globibo Book Translation: Connect with Readers in Any Language
Globibo Book Translation: Connect with Readers in Any LanguageGlobibo Book Translation: Connect with Readers in Any Language
Globibo Book Translation: Connect with Readers in Any Language
globibo
AI PRESENTATION.pptxfvvvvvvvvvvhbbbvvbbv
AI PRESENTATION.pptxfvvvvvvvvvvhbbbvvbbvAI PRESENTATION.pptxfvvvvvvvvvvhbbbvvbbv
AI PRESENTATION.pptxfvvvvvvvvvvhbbbvvbbv
xdlmao561
Scotland's Museums and Galleries Strategy Symposium 2025 - climate action wor...
Scotland's Museums and Galleries Strategy Symposium 2025 - climate action wor...Scotland's Museums and Galleries Strategy Symposium 2025 - climate action wor...
Scotland's Museums and Galleries Strategy Symposium 2025 - climate action wor...
Museums Galleries Scotland
Johari window introduction to identifying the personality attributes
Johari window introduction to identifying the personality attributesJohari window introduction to identifying the personality attributes
Johari window introduction to identifying the personality attributes
yogitabharatmandhany
2025-03-02 FATC 01 Annas, Caiaphas & The Sanhedrin (shared slides).pptx
2025-03-02 FATC 01 Annas, Caiaphas & The Sanhedrin (shared slides).pptx2025-03-02 FATC 01 Annas, Caiaphas & The Sanhedrin (shared slides).pptx
2025-03-02 FATC 01 Annas, Caiaphas & The Sanhedrin (shared slides).pptx
Dale Wells
Scotland's Museums and Galleries Strategy Symposium 2025 - diversity workshop...
Scotland's Museums and Galleries Strategy Symposium 2025 - diversity workshop...Scotland's Museums and Galleries Strategy Symposium 2025 - diversity workshop...
Scotland's Museums and Galleries Strategy Symposium 2025 - diversity workshop...
Museums Galleries Scotland
AAC2025_Danninger_Fail fast succeed smarter.pdf
AAC2025_Danninger_Fail fast succeed smarter.pdfAAC2025_Danninger_Fail fast succeed smarter.pdf
AAC2025_Danninger_Fail fast succeed smarter.pdf
Agile Austria Conference
Scotland's Museums and Galleries Strategy Symposium 2025 - education workshop...
Scotland's Museums and Galleries Strategy Symposium 2025 - education workshop...Scotland's Museums and Galleries Strategy Symposium 2025 - education workshop...
Scotland's Museums and Galleries Strategy Symposium 2025 - education workshop...
Museums Galleries Scotland
the el filibusterismo reporting lesson 3.4
the el filibusterismo reporting lesson 3.4the el filibusterismo reporting lesson 3.4
the el filibusterismo reporting lesson 3.4
riverajaysongit446
AMFI-Investor-Awareness-Presentation.4ee88c759f895e1be65d.pdf
AMFI-Investor-Awareness-Presentation.4ee88c759f895e1be65d.pdfAMFI-Investor-Awareness-Presentation.4ee88c759f895e1be65d.pdf
AMFI-Investor-Awareness-Presentation.4ee88c759f895e1be65d.pdf
sakivvikas86
FIFA Friendly Match at Alberni Valley - Strategic Plan.pptx
FIFA Friendly Match at Alberni Valley - Strategic Plan.pptxFIFA Friendly Match at Alberni Valley - Strategic Plan.pptx
FIFA Friendly Match at Alberni Valley - Strategic Plan.pptx
abuhasanjahangir
Isaiah Scudder Dealing with Stress.pptx
Isaiah Scudder  Dealing with Stress.pptxIsaiah Scudder  Dealing with Stress.pptx
Isaiah Scudder Dealing with Stress.pptx
FamilyWorshipCenterD
Satoshi Nakamoto - True Identity Revealed
Satoshi Nakamoto - True Identity RevealedSatoshi Nakamoto - True Identity Revealed
Satoshi Nakamoto - True Identity Revealed
Mike Hydes
RBC_Indices_Presentation_physiology (1).pptx
RBC_Indices_Presentation_physiology (1).pptxRBC_Indices_Presentation_physiology (1).pptx
RBC_Indices_Presentation_physiology (1).pptx
AyushSharma546188
It's a great presentation for everything
It's a great presentation for everythingIt's a great presentation for everything
It's a great presentation for everything
NonalynMagdagasang1
Science Communication beyond Journal Publications Workshop
Science Communication beyond Journal Publications WorkshopScience Communication beyond Journal Publications Workshop
Science Communication beyond Journal Publications Workshop
WAIHIGA K.MUTURI
Heraldry Gold's Whiteburn Gold Project (PDAC, March 2025)
Heraldry Gold's Whiteburn Gold Project (PDAC, March 2025)Heraldry Gold's Whiteburn Gold Project (PDAC, March 2025)
Heraldry Gold's Whiteburn Gold Project (PDAC, March 2025)
RonHawkes1

Self-hosting Kafka at Scale: Netflix's Journey & Challenges

  • 1. Self-hosting Kafka at Scale Netflixs Journey & Challenges Piyush Goyal, Staff Engineer, Data Platform Nick Mahilani, Staff Engineer, Data Platform Current 2024
  • 2. Thank you for being here! RAISE YOUR HAND IF YOU USE KAFKA IN YOUR ORGANIZATION
  • 3. KEEP YOUR HAND UP IF YOU ARE SELF-HOSTING APACHE KAFKA (NOT using a Kafka service provider)
  • 4. WHAT CAN YOU EXPECT FROM THIS SESSION? How Netflix leverages Kafka to unlock various use-cases ? Our Long Journey with Kafka How we operate Kafka today ? Challenges and learnings
  • 5. Business Context Keystone Platform (2015) Evolution to Composable Architecture Kafka as a Service (2021) KaaS Features and Architecture KaaS Learnings Our Journey With Kafka
  • 7. Microservices Ecosystem Systems at our scale generate a lot of data This data needs to be transported to where it can be processed and analysed
  • 8. Centralized Event Pipeline (2015) The System should have the following characteristics: Easy to use Highly Available Scalable Near Real-Time
  • 9. Centralized Event Pipeline (2015) The System should have the following characteristics: Easy to use Highly Available Scalable Near Real-Time This gave rise to Netflixs Keystone Platform in 2015
  • 10. Business Context Keystone Platform (2015) Evolution to Composable Architecture Kafka as a Service (2021) KaaS Features and Architecture KaaS Learnings Our Journey With Kafka
  • 11. Keystone Platform (2015) Highly abstracted product Data Movement to Sinks Simple Real-time processing (Filter, Projection) Client Library, UI, Management plane, and Data Plane Used Apache Kafka and Apache Flink under the hood
  • 12. Keystone - User Interface
  • 13. Keystone Platform Event Producers Publish events with keystone client library (Kafka-agnostic)
  • 14. Keystone Platform Event Producers Keystone Management Publish events with keystone client library (Kafka-agnostic)
  • 19. FRONTING CONSUMER Multi-tenant clusters Used to publish data Abstracted from producers Controlled Cluster access Critical for High availability Larger Fleet Multi-tenant clusters Used to consume data Coupled with consumers Smaller Fleet Two types of Kafka Clusters
  • 20. Resilience to cluster failure Keystone Client Stream Cluster Topic playback_events Cluster A playback_events ad_events Cluster B ad_events Topic lookup Cluster A Cluster B Topic: playback_events Fronting Topic: ad_events
  • 21. Resilience to cluster failure Keystone Client Stream Cluster Topic playback_events Cluster A Cluster B playback_events ad_events Cluster B ad_events Topic lookup Cluster A Cluster B Topic: playback_events Fronting Topic: ad_events Topic: playback_events
  • 22. Things worked well.. Highly abstracted and easy to use product Only takes a couple minutes to create simple data pipelines Huge adoption - more than 6000 data pipelines >100M message per seconds (>150GB/s) Quick real-time transformations like filtering and projection
  • 23. Not everything worked well For Streaming-only consumers, It was highly inefficient Unnecessary hops Higher latency Extra Cost Noisy neighbors in a multi-tenanted environment No direct access to Kafka for producers Administration of Kafka was semi-automated
  • 24. And we needed more.. Highly abstracted product means limited functionality done well Solved 80% use-cases, what about the rest? New Business Requirements demanded more functionality Event Driven Architecture Change Data Capture Low latency use-cases Custom Stream Processing Direct Kafka integration for Third party tools
  • 25. Business Context Keystone Platform (2015) Evolution to Composable Architecture Kafka as a Service (2021) KaaS Features and Architecture KaaS Learnings Our Journey With Kafka
  • 26. Closed System Pipeline Abstraction Pipeline Abstraction Kafka as a Service Stream Processing Composable System Architecture Evolution
  • 27. Whether to build or buy? We evaluated the tradeoffs for our situation (Year 2020-21) Customizability Long term costs Available in-house expertise Minimize Risks After careful consideration, we decided to BUILD our own managed Kafka Platform. YMMV!
  • 28. Business Context Keystone Data Pipeline (2015) Evolution to Composable Architecture Kafka as a Service (2021) KaaS Architecture KaaS Learnings Our Journey With Kafka
  • 29. Kafka as a Service (KaaS) Alerting & Auto Remediation Security & Access Control Observability Client Library Schema Management Provisioning
  • 32. High-availability Replication factor = 2 Min insync replicas = 1 Unclean leader election enabled Strong Consistency Replication factor = 3 Min insync replicas = 2 Unclean leader election disabled Kafka Cluster Configuration
  • 36. Business Context Keystone Data Pipeline (2015) Evolution to Composable Architecture Kafka as a Service (2021) KaaS Architecture KaaS Learnings Our Journey With Kafka
  • 40. KaaS Scale 190 million messages / second 150+ GB ingested / second 8+ PB persisted state 475+ dedicated Kafka Clusters 11,500 Kafka brokers 35,000 Kafka topics
  • 41. Business Context Keystone Data Pipeline (2015) Evolution to Composable Architecture Kafka as a Service (2021) KaaS Architecture KaaS Learnings Our Journey With Kafka
  • 42. 1. Scaling a single Kafka Cluster
  • 43. Scaling Up a Cluster before KaaS Topic partition counts were tightly coupled with number of brokers
  • 44. Using OSS Cruise Control: Topic partition counts independent of number of brokers Scaling Up a Cluster in KaaS
  • 45. 2. Making Cluster Upgrades Faster
  • 46. Upgrade time v/s State Size
  • 48. Unit of Change: AWS EC2 instance
  • 49. Kafka Fleet Upgrades Upgrade Time (old strategy) Desired Upgrade Time Upgrade Frequency Hardware Upgrade 3+ months < 1 month annually Software Upgrade 3+ months < 1 week monthly
  • 50. Software Upgrade Strategy #1 Leverage Amazon Elastic Block Store (EBS) Source: https://aws.amazon.com/ebs/
  • 51. Move Kafka state from local instance storage to EBS Software Upgrade Strategy #1
  • 52. EBS is expensive at large scale Moved large scale clusters back to AWS instance types with local disk Back to where we started longer upgrade times EBS is awesome but ..
  • 53. How can we upgrade faster without EBS?
  • 54. How can we upgrade faster without EBS? AWS Replace Root Volume Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/replace-root.html
  • 55. AWS Replace Root Volume to upgrade AMI Software Upgrade Strategy #2
  • 56. Kafka Fleet Upgrades with Replace Root Volume Strategy Current Upgrade Time Desired Upgrade Time Upgrade Frequency Hardware Upgrade 1+ month < 1 month annually Software Upgrade 5 days < 1 week monthly
  • 58. Right Sizing a Kafka Cluster Num Consumers Throughput Replication Factor Retention Which EC2 instance type? How many instances? How much disk?
  • 59. Right Sizing a Kafka Cluster Num Consumers Throughput Replication Factor Retention Kafka Capacity Model Num Brokers Instance Type Cost 3 i3en.2xl $ 3 i4i.2xl $$ 6 r5.4xl + EBS $$$ https://github.com/Netflix-Skunkworks/service-capacity-modeling/blob/main/service_capacity_modeling/models/org/netflix/kafka.py
  • 60. Business Context Keystone Data Pipeline (2015) Evolution to Composable Architecture Kafka as a Service (2021) KaaS Features and Architecture KaaS Learnings Our Journey With Kafka
  • 61. Composable architectures are easier to scale and evolve with the business Key Takeaway Closed System Pipeline Abstraction Pipeline Abstraction Kafka as a Service Stream Processing Composable System
  • 62. Q & A Self-hosting Kafka at Scale Netflixs Journey & Challenges Piyush Goyal Nick Mahilani
  • 63. S3 Flash Bootloader (precursor to AWS Replace Root Volume) Joeys talk on Capacity Plan optimally in the cloud Kyle and JS talk on Iterating faster on Stateful Services in the cloud References