Pivoting operation is very computation intensive. In cases, when the resulting table will contain more than 10k columns, this operation became especially demanding. In this talk, we will dive into how the current Spark implementation is ineffective in these cases, problems faced during the development and eventually how to implement this operation in a performant way.
Author: Sadik Bakiu (Data Reply)
3. INITIAL SET UP
Fuel Efficiency
Service Date
Service Calls
Production Date
Avg. Speed
BI Department
則 Oracle Backend
則 11.000+ Features
則 Sparsely populated
table
則 1000 cols/table
則 12 tables
則 Feature Engineering
則 Input, row format
4. PIVOT
Article Month Sales
1 June 1000
1 July 2000
1 June 1000
2 June 3000
2 June 4000
Article June July
1 2000 2000
2 7000
則 Article: Group Column
則 Month: Pivot Column
則 Sales: Value
5. PIVOT IN SPARK(1)
Article Month Sales
1 June 1000
1 July 2000
1 June 1000
2 June 3000
2 June 4000
Article Array
1 [2000, 2000]
2 [7000, null]
2 (3) Step Implementation:
1. Group by Grouping and Pivot Column and aggregate
2. Group by Grouping Column, aggregations in array
3. Values are projected in columns
Article June July
1 2000 2000
2 7000
Article Month Sales
1 June 2000
1 July 2000
2 June 7000
2.
3.
1.
Pivot values [June, July]
6. PIVOT IN SPARK(2)
Car Id Feat Name Feat Value
1 Category Truck
1 Color Blue
1 Type Box
2 Category Bus
2 Color Black
Car Id Array
1 [Truck, Blue, Box]
2 [Bus, Black, null]
Car Id Category Color Type
1 Truck Blue Box
2 Bus Black
Car Id Feat Name Feat Value
1 Category Truck
1 Color Blue
1 Type Box
2 Category Bus
2 Color Black
OUR CASE
Pivot values [Category, Color, Type]
2. 3.
1.
7. PIVOT
3 Step Implementation:
1. Add column, mapCol as
Map(Feature Name -> Feature Value)
2. Group by Grouping Column, collect_list(mapCol)
3. Project values to columns
8. PIVOT
Car Id Feat Name Feat Value
1 Category Truck
1 Color Blue
1 Type Box
2 Category Bus
2 Color Black
Car Id mapCol
1 Category -> Truck
1 Color -> Blue
1 Type -> Box
2 Category -> Bus
2 Color -> Black
1.
9. PIVOT
Car Id maps
1 [(Category -> Truck), (Color -> Blue), (Type -> Box)]
2 [(Category -> Bus), (Color -> Black)]
Car Id Category Color Type
1 Truck Blue Box
2 Bus Black
2.
3.
10. TABLE STRUCTURE
Destination Table:
1. Only one table
2. Hive table on HDFS
3. Theoretically infinite columns
1. If you populate up to 2000 of them
4. Separate metadata from actual data
12. TIPS AND TRICKS
則 Spark, immediately get distinct values in Pivot column
val values = df.select(pivotColumn)
.distinct()
.limit(maxValues + 1)
.sort(pivotColumn)
.collect()
.map(_.get(0))
.toSeq
則 Provide the pivot values, to improve performance
df.groupBy(groupCol).pivot(pivotCol, Seq(val1,val2)).agg()