This document discusses various issues and solutions related to optimizing Spark jobs on Amazon Web Services (AWS). It covers topics like dynamic resource allocation, partition pruning, file listing optimizations, output committer improvements, and iterative job performance. Specific problems addressed include slow full table scans on partitioned Hive tables, YARN locality issues, output committer failures, and Hive insert overwrite inconsistencies. The document advocates for techniques like showing partitions to the Hive metastore, parallel partition listing, local committers, and batch patterns to solve these problems and improve Spark job performance on AWS.
24. UH LFDWH S K RZQ
Case Behavior
Predicates with partition cols on partitioned table Single partition scan
Predicates with partition and non-partition cols on
partitioned table
Single partition scan
No predicate on partitioned table
e.g. sqlContext.table(nccp_log).take(10)
Full scan
No predicate on non-partitioned table Single partition scan
25. UH LFDWH S K RZQ IRU PHWD DWD
Analyzer
Optimizer
SparkPlanner
Parser
HiveMetastoreCatalog
getAllPartitions()
ResolveRelation
What if your table has 1.6M partitions?
26. . 6 (+
?? PSWRP
?? HU LQ D DLQ W KHDYLO SDUWLWLRQH 4LYH WDEOH L ORZ
?? 0D H
?? UH LFDWH DUH QRW S KH RZQ LQWR 4LYH PHWD WRUH R SDUN RH I OO
FDQ IRU WDEOH PHWD DWD
?? RO WLRQ
?? K RZQ ELQDU FRPSDUL RQ H SUH LRQ YLD HW DUWLWLRQ / ILOWHU LQ
WR 4LYH PHWD WRUH
27. UH LFDWH S K RZQ IRU PHWD DWD
Analyzer
Optimizer
SparkPlanner
Parser
HiveTableScan
getPartitionsByFilter()
HiveTableScans
29. 5QS W SOLW FRPS WDWLRQ
?? PDSUH FH LQS W ILOHLQS WIRUPDW OL W WDW Q P WKUHD
?? KH Q PEHU RI WKUHD WR H OL W DQ IHWFK EORFN ORFDWLRQ IRU WKH SHFLIL
H LQS W SDWK
?? HWWLQ WKL SURSHUW LQ SDUN MRE RH Q W KHOS
30. 3LOH OL WLQ IRU SDUWLWLRQH WDEOH
Partition path
Seq[RDD]
HadoopRDD
HadoopRDD
HadoopRDD
HadoopRDD
Partition path
Partition path
Partition path
Input dir
Input dir
Input dir
Input dir
Sequentially listing input dirs via S3N file system.
S3N
S3N
S3N
S3N
31. . 6 ++ ( . 6 &
?? PSWRP
?? 5QS W SOLW FRPS WDWLRQ IRU SDUWLWLRQH 4LYH WDEOH RQ L ORZ
?? 0D H
?? 7L WLQ ILOH RQ D SHU SDUWLWLRQ ED L L ORZ
?? 9 ILOH WHP FRPS WH DWD ORFDOLW KLQW
?? RO WLRQ
?? / ON OL W SDUWLWLRQ LQ SDUDOOHO LQ .PD RQ 0OLHQW
?? / SD DWD ORFDOLW FRPS WDWLRQ IRU REMHFW
32. E ON OL WLQ
Partition path
ParArray[RDD]
HadoopRDD
HadoopRDD
HadoopRDD
HadoopRDD
Partition path
Partition path
Partition path
Input dir
Input dir
Input dir
Input dir
Bulk listing input dirs in parallel via AmazonS3Client.
Amazon
S3Client
35. UREOHP , 4D RRS R WS W FRPPLWWHU
?? 4RZ LW ZRUN ,
?? 2DFK WD N ZULWH R WS W WR D WHPS LU
?? : WS W FRPPLWWHU UHQDPH ILU W FFH I O WD N WHPS LU WR
ILQDO H WLQDWLRQ
?? UREOHP ZLWK ,
?? UHQDPH L FRS DQ HOHWH
?? L HYHQW DO FRQ L WHQW
?? 3LOH9RW3R Q 2 FHSWLRQ ULQ UHQDPH
36. R WS W FRPPLWWHU
?? 4RZ LW ZRUN ,
?? 2DFK WD N ZULWH R WS W WR ORFDO L N
?? : WS W FRPPLWWHU FRSLH ILU W FFH I O WD N R WS W WR
?? . YDQWD H ,
?? .YRL UH DQDQW FRS
?? .YRL HYHQW DO FRQ L WHQF
37. UREOHP , 4LYH LQ HUW RYHUZULWH
?? 4RZ LW ZRUN ,
?? 1HOHWH DQ UHZULWH H L WLQ R WS W LQ SDUWLWLRQ
?? UREOHP ZLWK ,
?? L HYHQW DO FRQ L WHQW
?? 3LOH.OUHD 2 L W2 FHSWLRQ ULQ UHZULWH
38. /DWFKL SDWWHUQ
?? 4RZ LW ZRUN ,
?? 9HYHU HOHWH H L WLQ R WS W LQ SDUWLWLRQ
?? 2DFK MRE LQ HUW D QLT H ESDUWLWLRQ FDOOH EDWFKL
?? . YDQWD H ,
?? .YRL HYHQW DO FRQ L WHQF
42. ?? Zero installation
?? Dependency management via Docker
?? Notebook persistence
?? Elastic resources
:Q HPDQ QRWHERRN
43. Quick facts about Titan
?? Task execution platform leveraging Apache Mesos.
?? Manages underlying EC2 instances.
?? Process supervision and uptime in the face of failures.
?? Auto scaling