�ݺ�ߣ

PIVOTING WIDE
TABLES WITH
SPARK
Sadik Bakiu, Data Reply, 17.02.2020

1.Introduction
2.What is Pivot
3.Pivot in Spark
4.The other way
5.Tips and tricks
AGENDA

INITIAL SET UP
Fuel Efficiency
Service Date
Service Calls
Production Date
Avg. Speed
BI Department
§ Oracle Backend
§ 11.000+ Features
§ Sparsely populated
table
§ 1000 cols/table
§ 12 tables
§ Feature Engineering
§ Input, row format

PIVOT
Article Month Sales
1 June 1000
1 July 2000
1 June 1000
2 June 3000
2 June 4000
Article June July
1 2000 2000
2 7000
§ Article: Group Column
§ Month: Pivot Column
§ Sales: Value

PIVOT IN SPARK(1)
Article Month Sales
1 June 1000
1 July 2000
1 June 1000
2 June 3000
2 June 4000
Article Array
1 [2000, 2000]
2 [7000, null]
2 (3) Step Implementation:
1. Group by Grouping and Pivot Column and aggregate
2. Group by Grouping Column, aggregations in array
3. Values are projected in columns
Article June July
1 2000 2000
2 7000
Article Month Sales
1 June 2000
1 July 2000
2 June 7000
2.
3.
1.
Pivot values [June, July]

PIVOT IN SPARK(2)
Car Id Feat Name Feat Value
1 Category Truck
1 Color Blue
1 Type Box
2 Category Bus
2 Color Black
Car Id Array
1 [‘Truck’, ‘Blue’, ‘Box’]
2 [‘Bus’, ‘Black’, null]
Car Id Category Color Type
1 Truck Blue Box
2 Bus Black
1 Category Truck
1 Color Blue
1 Type Box
2 Category Bus
2 Color Black
OUR CASE
Pivot values [Category, Color, Type]
2. 3.
1.

PIVOT
3 Step Implementation:
1. Add column, mapCol as
Map(Feature Name -> Feature Value)
2. Group by Grouping Column, collect_list(mapCol)
3. Project values to columns

PIVOT
1 Category Truck
1 Color Blue
1 Type Box
2 Category Bus
2 Color Black
Car Id mapCol
1 Category -> Truck
1 Color -> Blue
1 Type -> Box
2 Category -> Bus
2 Color -> Black
1.

PIVOT
Car Id maps
1 [(Category -> Truck), (Color -> Blue), (Type -> Box)]
2 [(Category -> Bus), (Color -> Black)]
Car Id Category Color Type
1 Truck Blue Box
2 Bus Black
2.
3.

TABLE STRUCTURE
Destination Table:
1. Only one table
2. Hive table on HDFS
3. Theoretically infinite columns
1. If you populate up to 2000 of them
4. Separate metadata from actual data

TIPS AND TRICKS
spark.sql.pivotMaxValues
§ Max number of distinct values in pivot column, default 10000
§ Can be changed

TIPS AND TRICKS
§ Spark, immediately get distinct values in Pivot column
val values = df.select(pivotColumn)
.distinct()
.limit(maxValues + 1)
.sort(pivotColumn)
.collect()
.map(_.get(0))
.toSeq
§ Provide the pivot values, to improve performance
df.groupBy(“groupCol”).pivot(“pivotCol”, Seq(“val1”,“val2”)).agg(…)

WE’RE HIRING
s.bakiu@reply.de
apply@datareplygmbh.recruitee.com

�ݺ�ߣ

Pivoting Wide Tables with Spark

More Related Content

Pivoting Wide Tables with Spark