際際滷

際際滷Share a Scribd company logo
Ricardo Fanjul, letgo
Designing a Horizontally
Scalable Event-Driven Big
Data Architecture with
Apache Spark
#SAISExp2
2 0 1 8 : DATA O DY S S E Y
Ricardo Fanjul
Data Engineer
Founded in 2015
100MM+ downloads
400MM+ listings
#SAISExp2
L E T G O D ATA P L AT F O R M I N N U M B E R S
500GB
Data daily
Events Processed Daily
1 billion
50K
Peaks of events per Second
600+
Event Types
200TB
Storage (S3)
< 1sec
NRT Processing Time
#SAISExp2
T H E DAW N O F L E T G O
#SAISExp2
C L A S S I C A L B I P L AT F O R M
T H E D A W N O F L E T G O
#SAISExp2
C L A S S I C A L B I P L AT F O R M
T H E D A W N O F L E T G O
#SAISExp2
M OV I N G TO - S E RV I C E S A N D
E V E N T S
T H E D A W N O F L E T G O
亮
#SAISExp2
M OV I N G TO - S E RV I C E S A N D
E V E N T S
T H E D A W N O F L E T G O
Domain EventsTracking Events
亮
#SAISExp2
Data Ingestion
Storage
INGEST
Stream
Batch
PROCESS
Query
Data exploitation
Orchestration
DISCOVER
#SAISExp2
I N G E S T
Data Ingestion
Storage
INGEST PROCESS
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
Data Ingestion
#SAISExp2
O U R G OA L
DATA I N G E S T I O N
#SAISExp2
T H E D I S C OV E RY
DATA I N G E S T I O N
#SAISExp2
K A F K A C O N N E C T
DATA I N G E S T I O N
Amazon Aurora
#SAISExp2
Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark
T H E J O U R N E Y B E G I N S
Data Ingestion
INGEST PROCESS
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
Storage
#SAISExp2
B U I L D I N G T H E DATA L A K E
S TO R AG E
#SAISExp2
B U I L D I N G T H E DATA L A K E
S TO R AG E
We want to store all events coming from Kafka to S3.
#SAISExp2
B U I L D I N G T H E DATA L A K E
S TO R AG E
#SAISExp2
S O M E T I M E S  S H I T H A P P E N S
S TO R AG E
#SAISExp2
D U P L I C AT E D E V E N T S
S TO R AG E
#SAISExp2
D U P L I C AT E D E V E N T S
S TO R AG E
#SAISExp2
( V E RY ) L AT E E V E N T S
S TO R AG E
#SAISExp2
( V E RY ) L AT E E V E N T S
S TO R AG E
1
2
3
4
5
6
Dirty Buckets
1 Read batch of events from Kafka.
2 Write each event to Cassandra.
3 Write dirty hours to compact topic:
Key=(event_type, hour).
4 Read dirty hours topic.
5 Read all events with dirty hours.
6 Store in S3
#SAISExp2
S 3 P RO B L E M S
S TO R AG E
#SAISExp2
S 3 P RO B L E M S
S TO R AG E
1. Eventual consistency
2. Very slow renames
S O M E S 3 B I G DATA P RO B L E M S :
S 3 P RO B L E M S :
E V E N T UA L C O N S I S T E N C Y
S TO R AG E
#SAISExp2
S 3 P RO B L E M S :
E V E N T UA L C O N S I S T E N C Y
S TO R AG E
S3GUARD
S3AFileSystem
The Hadoop FileSystem for Amazon S3
FileSystem
Operations
S3 ClientDynamoDB Client
1
WRITE
fs metadata
READ
fs metadata
WRITE
object data
LIST
object info
LIST
object data
Write Path
Read Path 2 3
1 2
2 1 1 2 3
S 3 P RO B L E M S :
S L OW R E N A M E S
S TO R AG E

多Job freeze?
#SAISExp2
S 3 P RO B L E M S :
S L OW R E N A M E S
S TO R AG E
New Hadoop 3.1 S3A committers:
 Directory
 Partitioned
 Magic
#SAISExp2
P R O C E S S
INGEST PROCESS
Batch
DISCOVER
Query
Data exploitation
Orchestration
StreamData Ingestion
Storage
#SAISExp2
R E A L T I M E U S E R S E G M E N TAT I O N
S T R E A M
Stream Journal
User buckets changed
1 2
#SAISExp2
R E A L T I M E U S E R S E G M E N TAT I O N
S T R E A M
JOURNAL
User buckets changed
1 2
STREAM
#SAISExp2
R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
Is it still available?
What condition is it in?
Could we meet at?
Is the price negotiable?
I offer you.$
#SAISExp2
R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
{
"type": "meeting_proposal",
"properties": {
"location_name": Letgo HQ",
"geo": {
"lat": "41.390205",
"lon": "2.154007"
},
"date": "1511193820350",
"meeting_id": "23213213213"
}
}
Structured data
#SAISExp2
R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
Meeting proposed + meeting accepted = emit accepted-meeting event
Meeting proposed + nothing in X time = You have a proposal to meet
#SAISExp2
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Query
Data exploitation
Orchestration
Stream
Batch
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
{
"data": {
"id": "105dg3272-8e5f-426f-
bca0-704e98552961",
"type": "some_event",
"attributes": {
"latitude": 42.3677203,
"longitude": -83.1186093
}
},
"meta": {
"created_at": 1522886400036
}
}

Technically correct but
not very actionable
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
City: Detroit
Postal code: 48206
State: Michigan
DMA: Detroit
Country: US
What we know:
(42.3677203, -83.1186093)
#SAISExp2
G E O DATA E N R I C H M E N T
B AT C H
How we do it:
 Populating JTS indices from WKT polygon data
 Custom Spark SQL UDF
SELECT geodata.dma_name,
geodata.dma_number AS dma_number,
geodata.city AS city,
geodata.state AS state ,
geodata.zip_code AS zip_code
FROM (
SELECT
geodata(longitude, latitude) AS geodata
FROM .
)

#SAISExp2
D I S C O V E R
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Data exploitation
Orchestration
Stream
Batch
Query
#SAISExp2
Stream Journal
User buckets changed
1 2
QU E RY I N G DATA
QU E RY
#SAISExp2
Stream Journal
User buckets changed
1 2
QU E RY I N G DATA
QU E RY
#SAISExp2
Stream Journal
User buckets changed
1 2
QU E RY I N G DATA
QU E RY
Amazon Aurora
QU E RY I N G DATA
QU E RY
METASTORE
SPECTRUM
QU E RY I N G DATA
QU E RY
#SAISExp2
QU E RY I N G DATA
QU E RY
METASTORE
Thrift Server
Amazon Aurora
QU E RY I N G DATA
QU E RY
CREATE TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING,
...
dt DATE
)
USING json
PARTITIONED BY (`dt`)
CREATE TEMPORARY VIEW table_name
USING org.apache.spark.sql.cassandra
OPTIONS (
table "table_name",
keyspace "keyspace_name")
CREATE EXTERNAL TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING...,
dt DATE
)
PARTITIONED BY (`dt`)
USING PARQUET
LOCATION 's3a://bucket-name/database_name/table_name'
CREATE TABLE IF NOT EXISTS database_name.table_name
using com.databricks.spark.redshift
options (
dbtable 'schema.redshift_table_name',
tempdir 's3a://redshift-temp/',
url 'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo?
user=xxx&password=xxx',
forward_spark_s3_credentials 'true')
QU E RY I N G DATA
QU E RY
CREATETABLE
STORED AS
CREATETABLE
USING [parquet,json,csv]
70%
Higher performance!
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Creating the
table
Inserting data
1 2
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Creating the
table
1 CREATE EXTERNAL TABLE IF NOT EXISTS database.some_name(
user_id STRING,
column_b STRING,
...
)
USING PARQUET
PARTITIONED BY (`dt` STRING)
LOCATION 's3a://example/some_table'
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Inserting data
2 INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Problem?
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
 200 files because default value of
spark.sql.shuffle.partition
QU E RY I N G DATA : B AT C H E S W I T H
S Q L
QU E RY
INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
?
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
?
DISTRIBUTE BY (dt):
Only one 鍖le not Sorted
CLUSTERED BY (dt, user_id, column_b):
Multiple 鍖les
DISTRIBUTE BY (dt) SORT BY (user_id, column_b):
Only one 鍖le sorted by user_id, column_b.
Good for joins using this properties.
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
INSERT OVERWRITE TABLE database.some_name
PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
DISTRIBUTE BY (dt) SORT BY (user_id)
#SAISExp2
QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
#SAISExp2
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Orchestration
Stream
Batch
Query
Data exploitation
#SAISExp2
O U R A N A LY T I C A L S TAC K
DATA E X P L O I TAT I O N
#SAISExp2
O U R A N A LY T I C A L S TAC K
DATA E X P L O I TAT I O N
Amazon Aurora
METASTORE
Thrift Server
DATA S C I E N T I S T S T E A M ?
DATA E X P L O I TAT I O N
#SAISExp2
DATA S C I E N T I S T S A S I S E E T H E M
DATA E X P L O I TAT I O N
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
Too many
small files!
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
Huge Query!
#SAISExp2
DATA S C I E N T I S T S S I N S
DATA E X P L O I TAT I O N
Too much
shuffle!
#SAISExp2
Data Ingestion
Storage
INGEST PROCESS DISCOVER
Stream
Batch
Query
Data exploitation
Orchestration
#SAISExp2
A I R F L OW
O R C H E S T R AT I O N
#SAISExp2
A I R F L OW
O R C H E S T R AT I O N
#SAISExp2
A I R F L OW
O R C H E S T R AT I O N
Im happy!!
#SAISExp2
M OV I N G TO S TAT E L E S S
C L U S T E R
I  M S O R RY R I C A R D O. I  M
A F R A I D I C A N  T D O T H AT.
P L AT F O R M L I M I TAT I O N S
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
P L A N N I N G T H E S O L U T I O N
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
P L A N N I N G T H E S O L U T I O N
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
P L A N N I N G T H E S O L U T I O N
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
A L O N G A N D W I N D I N G P RO C E S S
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
A L O N G A N D W I N D I N G P RO C E S S
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
N E W C A PA B I L I T I E S O F T H E
P L AT F O R M
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
N E W C A PA B I L I T I E S O F T H E
P L AT F O R M
M OV I N G TO S TAT E L E S S C L U S T E R
#SAISExp2
J U P I T E R A N D B E YO N D
T H E I N F I N I T E
S O M E T H I N G WO N D E R F U L W I L L
H A P P E N
J U P I T E R A N D B E YO N D T H E I N F I N I T E
#SAISExp2
R E V E A L
J U P I T E R A N D B E YO N D T H E I N F I N I T E
https://youtu.be/CInMDMuSFwc
 A R T H U R C . C L A R K E
The only way to discover the limits of the possible
is to go beyond them into the impossible.
#SAISExp2
D O YO U WA N T TO J O I N U S ?
T H E F U T U R E  ?
https://boards.greenhouse.io/letgo

More Related Content

Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark

  • 1. Ricardo Fanjul, letgo Designing a Horizontally Scalable Event-Driven Big Data Architecture with Apache Spark #SAISExp2
  • 2. 2 0 1 8 : DATA O DY S S E Y
  • 4. Founded in 2015 100MM+ downloads 400MM+ listings #SAISExp2
  • 5. L E T G O D ATA P L AT F O R M I N N U M B E R S 500GB Data daily Events Processed Daily 1 billion 50K Peaks of events per Second 600+ Event Types 200TB Storage (S3) < 1sec NRT Processing Time #SAISExp2
  • 6. T H E DAW N O F L E T G O
  • 7. #SAISExp2 C L A S S I C A L B I P L AT F O R M T H E D A W N O F L E T G O #SAISExp2
  • 8. C L A S S I C A L B I P L AT F O R M T H E D A W N O F L E T G O #SAISExp2
  • 9. M OV I N G TO - S E RV I C E S A N D E V E N T S T H E D A W N O F L E T G O 亮 #SAISExp2
  • 10. M OV I N G TO - S E RV I C E S A N D E V E N T S T H E D A W N O F L E T G O Domain EventsTracking Events 亮 #SAISExp2
  • 12. I N G E S T
  • 13. Data Ingestion Storage INGEST PROCESS Stream Batch DISCOVER Query Data exploitation Orchestration Data Ingestion #SAISExp2
  • 14. O U R G OA L DATA I N G E S T I O N #SAISExp2
  • 15. T H E D I S C OV E RY DATA I N G E S T I O N #SAISExp2
  • 16. K A F K A C O N N E C T DATA I N G E S T I O N Amazon Aurora #SAISExp2
  • 18. T H E J O U R N E Y B E G I N S
  • 19. Data Ingestion INGEST PROCESS Stream Batch DISCOVER Query Data exploitation Orchestration Storage #SAISExp2
  • 20. B U I L D I N G T H E DATA L A K E S TO R AG E #SAISExp2
  • 21. B U I L D I N G T H E DATA L A K E S TO R AG E We want to store all events coming from Kafka to S3. #SAISExp2
  • 22. B U I L D I N G T H E DATA L A K E S TO R AG E #SAISExp2
  • 23. S O M E T I M E S S H I T H A P P E N S S TO R AG E #SAISExp2
  • 24. D U P L I C AT E D E V E N T S S TO R AG E #SAISExp2
  • 25. D U P L I C AT E D E V E N T S S TO R AG E #SAISExp2
  • 26. ( V E RY ) L AT E E V E N T S S TO R AG E #SAISExp2
  • 27. ( V E RY ) L AT E E V E N T S S TO R AG E 1 2 3 4 5 6 Dirty Buckets 1 Read batch of events from Kafka. 2 Write each event to Cassandra. 3 Write dirty hours to compact topic: Key=(event_type, hour). 4 Read dirty hours topic. 5 Read all events with dirty hours. 6 Store in S3 #SAISExp2
  • 28. S 3 P RO B L E M S S TO R AG E #SAISExp2
  • 29. S 3 P RO B L E M S S TO R AG E 1. Eventual consistency 2. Very slow renames S O M E S 3 B I G DATA P RO B L E M S :
  • 30. S 3 P RO B L E M S : E V E N T UA L C O N S I S T E N C Y S TO R AG E #SAISExp2
  • 31. S 3 P RO B L E M S : E V E N T UA L C O N S I S T E N C Y S TO R AG E S3GUARD S3AFileSystem The Hadoop FileSystem for Amazon S3 FileSystem Operations S3 ClientDynamoDB Client 1 WRITE fs metadata READ fs metadata WRITE object data LIST object info LIST object data Write Path Read Path 2 3 1 2 2 1 1 2 3
  • 32. S 3 P RO B L E M S : S L OW R E N A M E S S TO R AG E 多Job freeze? #SAISExp2
  • 33. S 3 P RO B L E M S : S L OW R E N A M E S S TO R AG E New Hadoop 3.1 S3A committers: Directory Partitioned Magic #SAISExp2
  • 34. P R O C E S S
  • 36. R E A L T I M E U S E R S E G M E N TAT I O N S T R E A M Stream Journal User buckets changed 1 2 #SAISExp2
  • 37. R E A L T I M E U S E R S E G M E N TAT I O N S T R E A M JOURNAL User buckets changed 1 2 STREAM #SAISExp2
  • 38. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M Is it still available? What condition is it in? Could we meet at? Is the price negotiable? I offer you.$ #SAISExp2
  • 39. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M { "type": "meeting_proposal", "properties": { "location_name": Letgo HQ", "geo": { "lat": "41.390205", "lon": "2.154007" }, "date": "1511193820350", "meeting_id": "23213213213" } } Structured data #SAISExp2
  • 40. R E A L T I M E PAT T E R N D E T E C T I O N S T R E A M Meeting proposed + meeting accepted = emit accepted-meeting event Meeting proposed + nothing in X time = You have a proposal to meet #SAISExp2
  • 41. Data Ingestion Storage INGEST PROCESS DISCOVER Query Data exploitation Orchestration Stream Batch #SAISExp2
  • 42. G E O DATA E N R I C H M E N T B AT C H #SAISExp2
  • 43. G E O DATA E N R I C H M E N T B AT C H { "data": { "id": "105dg3272-8e5f-426f- bca0-704e98552961", "type": "some_event", "attributes": { "latitude": 42.3677203, "longitude": -83.1186093 } }, "meta": { "created_at": 1522886400036 } } Technically correct but not very actionable #SAISExp2
  • 44. G E O DATA E N R I C H M E N T B AT C H City: Detroit Postal code: 48206 State: Michigan DMA: Detroit Country: US What we know: (42.3677203, -83.1186093) #SAISExp2
  • 45. G E O DATA E N R I C H M E N T B AT C H How we do it: Populating JTS indices from WKT polygon data Custom Spark SQL UDF SELECT geodata.dma_name, geodata.dma_number AS dma_number, geodata.city AS city, geodata.state AS state , geodata.zip_code AS zip_code FROM ( SELECT geodata(longitude, latitude) AS geodata FROM . ) #SAISExp2
  • 46. D I S C O V E R
  • 47. Data Ingestion Storage INGEST PROCESS DISCOVER Data exploitation Orchestration Stream Batch Query #SAISExp2
  • 48. Stream Journal User buckets changed 1 2 QU E RY I N G DATA QU E RY #SAISExp2
  • 49. Stream Journal User buckets changed 1 2 QU E RY I N G DATA QU E RY #SAISExp2
  • 50. Stream Journal User buckets changed 1 2 QU E RY I N G DATA QU E RY Amazon Aurora
  • 51. QU E RY I N G DATA QU E RY METASTORE SPECTRUM
  • 52. QU E RY I N G DATA QU E RY #SAISExp2
  • 53. QU E RY I N G DATA QU E RY METASTORE Thrift Server Amazon Aurora
  • 54. QU E RY I N G DATA QU E RY CREATE TABLE IF NOT EXISTS database_name.table_name( some_column STRING, ... dt DATE ) USING json PARTITIONED BY (`dt`) CREATE TEMPORARY VIEW table_name USING org.apache.spark.sql.cassandra OPTIONS ( table "table_name", keyspace "keyspace_name") CREATE EXTERNAL TABLE IF NOT EXISTS database_name.table_name( some_column STRING..., dt DATE ) PARTITIONED BY (`dt`) USING PARQUET LOCATION 's3a://bucket-name/database_name/table_name' CREATE TABLE IF NOT EXISTS database_name.table_name using com.databricks.spark.redshift options ( dbtable 'schema.redshift_table_name', tempdir 's3a://redshift-temp/', url 'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo? user=xxx&password=xxx', forward_spark_s3_credentials 'true')
  • 55. QU E RY I N G DATA QU E RY CREATETABLE STORED AS CREATETABLE USING [parquet,json,csv] 70% Higher performance!
  • 56. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Creating the table Inserting data 1 2 #SAISExp2
  • 57. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Creating the table 1 CREATE EXTERNAL TABLE IF NOT EXISTS database.some_name( user_id STRING, column_b STRING, ... ) USING PARQUET PARTITIONED BY (`dt` STRING) LOCATION 's3a://example/some_table' #SAISExp2
  • 58. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Inserting data 2 INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... #SAISExp2
  • 59. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY Problem? #SAISExp2
  • 60. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY 200 files because default value of spark.sql.shuffle.partition
  • 61. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... ? #SAISExp2
  • 62. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY ? DISTRIBUTE BY (dt): Only one 鍖le not Sorted CLUSTERED BY (dt, user_id, column_b): Multiple 鍖les DISTRIBUTE BY (dt) SORT BY (user_id, column_b): Only one 鍖le sorted by user_id, column_b. Good for joins using this properties. #SAISExp2
  • 63. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY INSERT OVERWRITE TABLE database.some_name PARTITION(dt) SELECT user_id, column_b, dt FROM other_table ... DISTRIBUTE BY (dt) SORT BY (user_id) #SAISExp2
  • 64. QU E RY I N G DATA : B AT C H E S W I T H S Q L QU E RY #SAISExp2
  • 65. Data Ingestion Storage INGEST PROCESS DISCOVER Orchestration Stream Batch Query Data exploitation #SAISExp2
  • 66. O U R A N A LY T I C A L S TAC K DATA E X P L O I TAT I O N #SAISExp2
  • 67. O U R A N A LY T I C A L S TAC K DATA E X P L O I TAT I O N Amazon Aurora METASTORE Thrift Server
  • 68. DATA S C I E N T I S T S T E A M ? DATA E X P L O I TAT I O N #SAISExp2
  • 69. DATA S C I E N T I S T S A S I S E E T H E M DATA E X P L O I TAT I O N #SAISExp2
  • 70. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N #SAISExp2
  • 71. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Too many small files! #SAISExp2
  • 72. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Huge Query! #SAISExp2
  • 73. DATA S C I E N T I S T S S I N S DATA E X P L O I TAT I O N Too much shuffle! #SAISExp2
  • 74. Data Ingestion Storage INGEST PROCESS DISCOVER Stream Batch Query Data exploitation Orchestration #SAISExp2
  • 75. A I R F L OW O R C H E S T R AT I O N #SAISExp2
  • 76. A I R F L OW O R C H E S T R AT I O N #SAISExp2
  • 77. A I R F L OW O R C H E S T R AT I O N Im happy!! #SAISExp2
  • 78. M OV I N G TO S TAT E L E S S C L U S T E R
  • 79. I M S O R RY R I C A R D O. I M A F R A I D I C A N T D O T H AT.
  • 80. P L AT F O R M L I M I TAT I O N S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 81. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 82. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 83. P L A N N I N G T H E S O L U T I O N M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 84. A L O N G A N D W I N D I N G P RO C E S S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 85. A L O N G A N D W I N D I N G P RO C E S S M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 86. N E W C A PA B I L I T I E S O F T H E P L AT F O R M M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 87. N E W C A PA B I L I T I E S O F T H E P L AT F O R M M OV I N G TO S TAT E L E S S C L U S T E R #SAISExp2
  • 88. J U P I T E R A N D B E YO N D T H E I N F I N I T E
  • 89. S O M E T H I N G WO N D E R F U L W I L L H A P P E N J U P I T E R A N D B E YO N D T H E I N F I N I T E #SAISExp2
  • 90. R E V E A L J U P I T E R A N D B E YO N D T H E I N F I N I T E https://youtu.be/CInMDMuSFwc
  • 91. A R T H U R C . C L A R K E The only way to discover the limits of the possible is to go beyond them into the impossible. #SAISExp2
  • 92. D O YO U WA N T TO J O I N U S ? T H E F U T U R E ? https://boards.greenhouse.io/letgo