際際滷

際際滷Share a Scribd company logo
How We Built an Event-Time
Merge of Two Kafka-Streams
with Spark Streaming
Ralf Sigmund, Sebastian Schr旦der
Otto GmbH & Co. KG
Agenda
 Who are we?
 Our Setup
 The Problem
Our Approach
 Requirements
 Advanced Requirements
 Lessons learned
Who are we?
 Ralf
Technical Designer Team Tracking
Cycling, Surfing
Twitter: @sistar_hh
 Sebastian
Developer Team Tracking
Taekwondo, Cycling
Twitter: @Sebasti0n
Who are we working for?
 2nd largest ecommerce company in Germany
 more than 1 million visitors every day
 our blog: https://dev.otto.de
 our jobs: www.otto.de/jobs
Our Setup
otto.de
Varnish Reverse
Proxy
Tracking-Server
Vertical 1Vertical 0
Vertical n
Service 0
Service 1
Service n
The Problem
otto.de
Varnish Reverse
Proxy
Vertical 1Vertical 0
Vertical n
Service 0
Service 1
Service n
Has a limit on the
size of messages
Needs to send
large tracking
messages
Tracking-Server
Our Approach
otto.de
Varnish Reverse
Proxy
Tracking-Server
Vertical 1Vertical 0
Vertical n
S0
S1
Sn
Spark
Service
Requirements
Messages with the same key are
merged
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3
id: 2
t: 2
id: 3
t: 2
id: 3 t: 4
id: 0
Messages have to be timed out
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3
id: 2
t: 2
id: 3
t: 2
id: 3 t: 4
id: 0
max(business event timestamp)
time-out after 3 sec
Event Streams might get stuck
t: 1
id: 1
t: 3
id: 2
t: 2
id: 3
t: 0
id: 2
t: 4
id: 0
t: 2
id: 3
t: 1
id: 1
t: 0
id: 2
t: 3
id: 2
t: 2
id: 3
t: 2
id: 3 t: 4
id: 0
clock A
clock B
clock: min(clock A, clock B)
There might be no messages
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
t: 5
heart
beat
t: 1
id: 1
t: 4
id: 2
t: 2
id: 3
time-out after 3 sec
t: 5
heart
beat
clock: min(clock A, clock B) : 4
Requirements Summary
 Messages with the same key are merged
 Messages are timed out by the combined event
time of both source topics
 Timed out messages are sorted by event time
Solution
Merge messages with the same Id
UpdateStateByKey, which is defined on pairwise
RDDs
It applies a function to all new messages and
existing state for a given key
Returns a StateDStream with the resulting
messages
UpdateStateByKey
1
2
3
1
2
3
1
2
3
UpdateFunction:
(Seq[InputMessage], Option[MergedMessage]) =>
Option[MergedMessage]
Merge messages with the same Id
Merge messages with the same Id
Flag And Remove Timed Out Msgs
1
MM
TO
2
2
3
MM
MM
TO
MM
TO
2
3
1 MM
1 MM
3
needs global
Business
Clock
UpdateFunction
Messages are timed out by event time
We need access to all messages from the
current micro batch and the history (state) =>
Custom StateDStream
Compute the maximum event time of both topics
and supply it to the updateStateByKey function
Messages are timed out by event time
Messages are timed out by event time
Messages are timed out by event time
Messages are timed out by event time
Advanced Requirements
topicBtopicA Effect of different request rates in
source topics
10K
5K
0
10K
5K
0
catching up
with 5K per
microbatch
data read too
early
Handle different request rates in topics
Stop reading from one topic if the event time of
other topic is too far behind
Store latest event time of each topic and
partition on driver
Use custom KafkaDStream with clamp method
which uses the event time
Handle different request rates in topics
Handle different request rates in topics
Handle different request rates in topics
Handle different request rates in topics
Guarantee at least once semantics
o: 1
t: a
o: 7
t: a
o: 2
t: a
o: 5
t: b
o: 9
t: b
t: 6
t: b
Current State Sink Topic
a-offset: 1
b-offset: 5
a-offset: 1
b-offset: 5
a-offset: 1
b-offset: 5
On Restart/Recovery
Retrieve last message
Timeout
Spark Merge Service
Guarantee at least once semantics
Guarantee at least once semantics
Everything put together
Lessons learned
 Excellent Performance and Scalability
 Extensible via Custom RDDs and Extension to
DirectKafka
No event time windows in Spark Streaming
See: https://github.com/apache/spark/pull/2633
 Checkpointing cannot be used when deploying new
artifact versions
 Driver/executor model is powerful, but also complex
THANK YOU.
Twitter:
@sistar_hh
@Sebasti0n
Blog: dev.otto.de
Jobs: otto.de/jobs
Ad

Recommended

We Want You Back Automation
We Want You Back Automation
Kevin Raffay
Processing messages in a sqs with lambda function
Processing messages in a sqs with lambda function
Subhamay Bhattacharyya
Observer, a "real life" time series application
Observer, a "real life" time series application
K辿vin LOVATO
Infrastructure as code with troposphere on aws in 5 min
Infrastructure as code with troposphere on aws in 5 min
Mathias M奪rtens
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
HostedbyConfluent
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014
K辿vin LOVATO
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
Apache Kafka TLV
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
Databricks
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
Confluent-Ably-AWS-ID-2023 - G際際滷.pptx
Confluent-Ably-AWS-ID-2023 - G際際滷.pptx
Ahmed791434
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Accentfuture
An Introduction to time series with Team Apache
An Introduction to time series with Team Apache
Patrick McFadin
Streaming Analytics
Streaming Analytics
Neera Agarwal
Streaming analytics better than batch when and why by Dawid Wysakowicz and ...
Streaming analytics better than batch when and why by Dawid Wysakowicz and ...
Big Data Spain
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
Streaming analytics better than batch when and why - (Big Data Tech 2017)
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Soroosh Khodami
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Accentfuture
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
Decipher SEO Solutions for your startup needs.
Decipher SEO Solutions for your startup needs.
mathai2
Streamlining CI/CD with FME Flow: A Practical Guide
Streamlining CI/CD with FME Flow: A Practical Guide
Safe Software

More Related Content

Similar to How we built an event-time merge of two kafka-streams with spark-streaming (20)

Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
Databricks
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
Confluent-Ably-AWS-ID-2023 - G際際滷.pptx
Confluent-Ably-AWS-ID-2023 - G際際滷.pptx
Ahmed791434
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Accentfuture
An Introduction to time series with Team Apache
An Introduction to time series with Team Apache
Patrick McFadin
Streaming Analytics
Streaming Analytics
Neera Agarwal
Streaming analytics better than batch when and why by Dawid Wysakowicz and ...
Streaming analytics better than batch when and why by Dawid Wysakowicz and ...
Big Data Spain
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
Streaming analytics better than batch when and why - (Big Data Tech 2017)
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Soroosh Khodami
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Accentfuture
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
Databricks
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
Confluent-Ably-AWS-ID-2023 - G際際滷.pptx
Confluent-Ably-AWS-ID-2023 - G際際滷.pptx
Ahmed791434
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Accentfuture
An Introduction to time series with Team Apache
An Introduction to time series with Team Apache
Patrick McFadin
Streaming Analytics
Streaming Analytics
Neera Agarwal
Streaming analytics better than batch when and why by Dawid Wysakowicz and ...
Streaming analytics better than batch when and why by Dawid Wysakowicz and ...
Big Data Spain
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
nilanjan172nsvian
Streaming analytics better than batch when and why - (Big Data Tech 2017)
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Soroosh Khodami
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Learn Apache Kafka Online | Comprehensive Kafka Course & Training
Accentfuture
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
ucelebi

Recently uploaded (20)

Decipher SEO Solutions for your startup needs.
Decipher SEO Solutions for your startup needs.
mathai2
Streamlining CI/CD with FME Flow: A Practical Guide
Streamlining CI/CD with FME Flow: A Practical Guide
Safe Software
AI for PV: Development and Governance for a Regulated Industry
AI for PV: Development and Governance for a Regulated Industry
Biologit
Best Software Development at Best Prices
Best Software Development at Best Prices
softechies7
Why Edge Computing Matters in Mobile Application Tech.pdf
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
Azure AI Foundry: The AI app and agent factory
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
declaration of Variables and constants.pptx
declaration of Variables and constants.pptx
meemee7378
Complete WordPress Programming Guidance Book
Complete WordPress Programming Guidance Book
Shabista Imam
From Data Preparation to Inference: How Alluxio Speeds Up AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
HYBRIDIZATION OF ALKANES AND ALKENES ...
HYBRIDIZATION OF ALKANES AND ALKENES ...
karishmaduhijod1
Sysinfo OST to PST Converter Infographic
Sysinfo OST to PST Converter Infographic
SysInfo Tools
Test Case Design Techniques Practical Examples & Best Practices in Software...
Test Case Design Techniques Practical Examples & Best Practices in Software...
Muhammad Fahad Bashir
Sap basis role in public cloud in s/4hana.pptx
Sap basis role in public cloud in s/4hana.pptx
htmlprogrammer987
Modern Platform Engineering with Choreo - The AI-Native Internal Developer Pl...
Modern Platform Engineering with Choreo - The AI-Native Internal Developer Pl...
WSO2
Which Hiring Management Tools Offer the Best ROI?
Which Hiring Management Tools Offer the Best ROI?
HireME
Humans vs AI Call Agents - Qcall.ai's Special Report
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
Building Geospatial Data Warehouse for GIS by GIS with FME
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
Simplify Insurance Regulations with Compliance Management Software
Simplify Insurance Regulations with Compliance Management Software
Insurance Tech Services
Decipher SEO Solutions for your startup needs.
Decipher SEO Solutions for your startup needs.
mathai2
Streamlining CI/CD with FME Flow: A Practical Guide
Streamlining CI/CD with FME Flow: A Practical Guide
Safe Software
AI for PV: Development and Governance for a Regulated Industry
AI for PV: Development and Governance for a Regulated Industry
Biologit
Best Software Development at Best Prices
Best Software Development at Best Prices
softechies7
Why Edge Computing Matters in Mobile Application Tech.pdf
Why Edge Computing Matters in Mobile Application Tech.pdf
IMG Global Infotech
Azure AI Foundry: The AI app and agent factory
Azure AI Foundry: The AI app and agent factory
Maxim Salnikov
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
declaration of Variables and constants.pptx
declaration of Variables and constants.pptx
meemee7378
Complete WordPress Programming Guidance Book
Complete WordPress Programming Guidance Book
Shabista Imam
From Data Preparation to Inference: How Alluxio Speeds Up AI
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
HYBRIDIZATION OF ALKANES AND ALKENES ...
HYBRIDIZATION OF ALKANES AND ALKENES ...
karishmaduhijod1
Sysinfo OST to PST Converter Infographic
Sysinfo OST to PST Converter Infographic
SysInfo Tools
Test Case Design Techniques Practical Examples & Best Practices in Software...
Test Case Design Techniques Practical Examples & Best Practices in Software...
Muhammad Fahad Bashir
Sap basis role in public cloud in s/4hana.pptx
Sap basis role in public cloud in s/4hana.pptx
htmlprogrammer987
Modern Platform Engineering with Choreo - The AI-Native Internal Developer Pl...
Modern Platform Engineering with Choreo - The AI-Native Internal Developer Pl...
WSO2
Which Hiring Management Tools Offer the Best ROI?
Which Hiring Management Tools Offer the Best ROI?
HireME
Humans vs AI Call Agents - Qcall.ai's Special Report
Humans vs AI Call Agents - Qcall.ai's Special Report
Udit Goenka
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
CodeCleaner: Mitigating Data Contamination for LLM Benchmarking
arabelatso
Building Geospatial Data Warehouse for GIS by GIS with FME
Building Geospatial Data Warehouse for GIS by GIS with FME
Safe Software
Simplify Insurance Regulations with Compliance Management Software
Simplify Insurance Regulations with Compliance Management Software
Insurance Tech Services
Ad

How we built an event-time merge of two kafka-streams with spark-streaming