This document discusses developing Apache Pig on Apache Tez. It provides background on Pig and Tez, describes how Pig was rewritten to run on Tez instead of MapReduce, and outlines the benefits of this including improved performance and resource utilization. It also discusses the collaboration between multiple companies to undertake this work as an open source project over 6 months, setting a goal of making Pig run 2x faster, and how knowledge and credit were shared across the participating organizations.
1 of 33
Downloaded 24 times
More Related Content
Developing Pig on Tez (ApacheCon 2014)
1. Developing Pig on TezDeveloping Pig on Tez
Mark WagnerMark Wagner
Committer, Apache PigCommitter, Apache Pig
LinkedInLinkedIn
Cheolsoo ParkCheolsoo Park
VP, Apache PigVP, Apache Pig
NetflixNetflix
2. What is Pig
¡ñ
Apache project since 2008
¡ñ
Higher level language for Hadoop that provides a dataflow language
with a MapReduce based execution engine
A = LOAD 'input.txt';
B = FOREACH A GENERATE flatten(TOKENIZE((chararray)$0))
AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
STORE D INTO './output.txt';
6. Pig Latin
A = LOAD 'input.txt';
B = FOREACH A GENERATE
flatten(TOKENIZE((chararray)$0))
AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
STORE D INTO './output.txt';
10. What's the problem
¡ñ
Extra intermediate output
¡ñ
Artificial synchronization barriers
¡ñ
Inefficient use of resources
¡ñ
Multiquery Optimizer
¡ñ
Alleviates some problems
¡ñ
Has its own
12. Tez Concepts
¡ñ
Job expressed as directed acyclic graph (DAG)
¡ñ
Processing done at vertices
¡ñ
Data flows along edges
Mapper
Reducer
Processor Processor
Processor
Processor
13. Benefits & Optimizations
¡ñ
Fewer synchronization barriers
¡ñ
Container Reuse
¡ñ
Object caches at the vertices
¡ñ
Dynamic parallelism estimation
¡ñ
Custom data transfer between processors
14. What we've done for Pig
¡ñ
New execution engine based on Tez
¡ñ
Physical Plan translated to Tez Plan instead of Map Reduce Plan
¡ñ
Same Physical Plan and operators
¡ñ
Custom processors run the execution plan on Tez
15. Along the way
¡ñ
New pluggable execution backend
¡ñ
Made operator set more generic
¡ñ
Motivated Tez improvements
16. Group By
LOAD
GROUP BY, SUM
Identity
GROUP BY
HDFS
LOAD
GROUP BY, STORE
GROUP BY, SUM
f = LOAD ¡®foo¡¯
AS (x:int, y:int);
g = GROUP f BY x;
h = FOREACH g GENERATE
group AS r,
SUM(f.y) as s;
i = GROUP h BY s;
17. Join
LOAD l, r
JOIN, STORE
LOAD r
JOIN, STORE
LOAD l
l = LOAD ¡®left¡¯ AS (x, y);
r = LOAD ¡®right¡¯ AS (x, z);
j = JOIN l BY x, r BY x;
18. Group By
LOAD
GROUP f BY x,
GROUP f BY y
LOAD g, h
JOIN
HDFS
LOAD
JOIN
GROUP BYGROUP BY
f = LOAD ¡®foo¡¯
AS (x:int, y:int);
g = GROUP f BY x;
h = GROUP f BY y;
i = JOIN g BY group,
h BY group;
20. Performance Comparison
Replicated Join
(2.8x)
Join + Group By
(1.5x)
Join + Group By +
Order By (1.5x)
3 way Split + Join
+ Group By (2.6x)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Map Reduce
Tez
21. How it started
Shared interests across organizations
? Similar data platform architecture.
? Pig for ETL jobs
? Hive for ad-hoc queries
22. How it started
Shared interests across organizations
? Hortonworks wants Tez to succeed.
23. Community meet-ups helped
? Twitter presented summer intern¡¯s POC work at Tez meet-up.
? Pig devs exchanged interests.
Organizing team
24. Organizing team
Community meet-ups helped
? Tez team hosted tutorial sessions for Pig devs.
? Pig team got together to brainstorm implementation design.
25. Companies showed commitment to the project
? Hortonworks: Daniel Dai
? LinkedIn: Alex Bain, Mark Wagner
? Netflix: Cheolsoo Park
? Yahoo: Olga Natkovich, Rohini Palaniswamy
Building trust
26. Make Pig 2x faster within 6 months
? Hive-on-Tez showed 2x performance gain.
? Rewriting the Pig backend within 6 months seemed reasonable.
Setting goals
27. Acting as team
Sprint
? Monthly planning meetings
? Twice-a-week stand-up conference calls
Issues / discussions
? PIG-3446 umbrella jira for Pig on Tez
? Whiteboard discussions at meetings
28. ? Pig old timer Daniel Dai acted as mentor.
? Everyone got to work on core functionalities.
? Everyone became an expert on the Pig backend.
Knowledge transfer
29. Sharing credit
? Elected as a new committer and PMC chair.
? Gave talks at Hadoop User Group and Pig User Group meet-ups.
? Speaking at ApacheCon and upcoming Hadoop Summit.
30. Further collaborations
Looking for more collaborations
? Parquet Hive SerDe improvements.
? Sharing experiences with SQL-on-Hadoop solutions.
31. Mind shift
¡°If we can¡¯t hire all these good people, why don¡¯t we use them in a
collaboration?¡±
? Collaboration instead of competition.
32. Mind shift
¡°Why do we reinvent the wheel?¡±
? Share the same technologies while creating different services.