ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Hive at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved.
Hive at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved.
Agenda
 LinkedIn Data and its Ecosystem
 Performance Improvements – Avro
 User experiences
3
©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Data Sources
 Event Data
– Page Views
– Clicks
– Search queries
 Database Data
– Profile (Users & Companies)
– Connections
 External Data
– Salesforce, DoubleClick
4
©2013 LinkedIn Corporation. All Rights Reserved.
Member Data
(Profiles)
Espresso
and RDBMS
External
Partner Data
Member Activity
(Page views,
button clicks)
Kafka Topics
Front-end
Serving
Systems
Member-facing
systems
Lots of cool stuff
not in this picture!
Where's the Data at LinkedIn?
© 2013 LinkedIn 24 June 2013
Data Ecosystem at LinkedIn
5
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
6
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
7
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
8
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data Ecosystem at LinkedIn
9
Member
Facing
Systems
©2013 LinkedIn Corporation. All Rights Reserved.
Data in Hadoop
 Almost all LinkedIn data is stored in Hadoop
 Tools used
– Hive/HCatalog
– Pig
– Java MapReduce
– Azkaban
10
©2013 LinkedIn Corporation. All Rights Reserved.
Hive Usage
 Use-cases
– Ad-hoc query
– Reporting
– Building Platforms
 Segmentation Engine
 Experimentations Engine
 Users
– Data Scientist
– Business Analytics
– Security team
– Product team
11
©2013 LinkedIn Corporation. All Rights Reserved.
Hive Challenges
 Performance
– Faster query execution
 Performance
– Faster query execution
 Efficient MR* execution plan
– Effective resource usage
– Ensure cluster stability
12
©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Hive Initiatives
 Make HCatalog work and deploy [OnGoing]
 Hive Performance Improvement (Avro data reading) [On
Going]
 Stabilize Hive Server 2 at LI [About to Start]
 Expand the scope of HCatalog metadata [Planning]
13
©2013 LinkedIn Corporation. All Rights Reserved.
HCatalog Initiatives
 Expand scope of meta-data
– Who creates this data?
– What are the inputs?
 Helpful to create data lineage
– Who is the maintainer of data?
14
©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
©2013 LinkedIn Corporation. All Rights Reserved.
What is the Problem?
 Reading Avro record takes long time.
– 52 micro-second/record
 Found the hotspot using VisualVm
16
©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #1
 Reduce the number of Schema.equals() calls
 Schema equality checks required primarily for evolved
schema.
 Solution includes caching to avoid unnecessary
expensive calls
 Results
– Trunk read overhead : 52 μs/record
– After this patch read overhead : 32 μs/record
17
©2013 LinkedIn Corporation. All Rights Reserved.
Improvement #2
 Reduce extra data transformations
 Solution is to provide custom object inspectors
 Results
– Current read overhead : 52 μs/record
– After this patch read overhead : 30 μs/record
18
©2013 LinkedIn Corporation. All Rights Reserved.
Final Results
19
55
32
30
11
0
10
20
30
40
50
60
Trunk Improvement #1 Improvement #2 Combined
©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
©2013 LinkedIn Corporation. All Rights Reserved.
56%Never Used Hive
44%Use Hive
27%Primarily use Hive
Out of all our Hadoop users:
Hive User Base at LinkedIn
21
of Hive jobs were from ad-hoc queries32%
©2013 LinkedIn Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Who uses Hive and who doesn’t
22
Data Scientists
Engineers
Product Managers
Customer Support Specialists
Analysts
Hive adoption among Hadoop users by job title
©2013 LinkedIn Corporation. All Rights Reserved.
Top concerns about Hive
23
Not friendly for long/complex workflows
Performance, especially for ad-hoc queries
Steep learning curve for tuning
Data/UDFs unavailability
Hive at LinkedIn

More Related Content

Hive at LinkedIn

  • 2. ©2013 LinkedIn Corporation. All Rights Reserved. Hive at LinkedIn
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. Agenda  LinkedIn Data and its Ecosystem  Performance Improvements – Avro  User experiences 3
  • 4. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Data Sources  Event Data – Page Views – Clicks – Search queries  Database Data – Profile (Users & Companies) – Connections  External Data – Salesforce, DoubleClick 4
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. Member Data (Profiles) Espresso and RDBMS External Partner Data Member Activity (Page views, button clicks) Kafka Topics Front-end Serving Systems Member-facing systems Lots of cool stuff not in this picture! Where's the Data at LinkedIn? © 2013 LinkedIn 24 June 2013 Data Ecosystem at LinkedIn 5 Member Facing Systems
  • 6. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 6 Member Facing Systems
  • 7. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 7 Member Facing Systems
  • 8. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 8 Member Facing Systems
  • 9. ©2013 LinkedIn Corporation. All Rights Reserved. Data Ecosystem at LinkedIn 9 Member Facing Systems
  • 10. ©2013 LinkedIn Corporation. All Rights Reserved. Data in Hadoop  Almost all LinkedIn data is stored in Hadoop  Tools used – Hive/HCatalog – Pig – Java MapReduce – Azkaban 10
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Usage  Use-cases – Ad-hoc query – Reporting – Building Platforms  Segmentation Engine  Experimentations Engine  Users – Data Scientist – Business Analytics – Security team – Product team 11
  • 12. ©2013 LinkedIn Corporation. All Rights Reserved. Hive Challenges  Performance – Faster query execution  Performance – Faster query execution  Efficient MR* execution plan – Effective resource usage – Ensure cluster stability 12
  • 13. ©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn Hive Initiatives  Make HCatalog work and deploy [OnGoing]  Hive Performance Improvement (Avro data reading) [On Going]  Stabilize Hive Server 2 at LI [About to Start]  Expand the scope of HCatalog metadata [Planning] 13
  • 14. ©2013 LinkedIn Corporation. All Rights Reserved. HCatalog Initiatives  Expand scope of meta-data – Who creates this data? – What are the inputs?  Helpful to create data lineage – Who is the maintainer of data? 14
  • 15. ©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. What is the Problem?  Reading Avro record takes long time. – 52 micro-second/record  Found the hotspot using VisualVm 16
  • 17. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #1  Reduce the number of Schema.equals() calls  Schema equality checks required primarily for evolved schema.  Solution includes caching to avoid unnecessary expensive calls  Results – Trunk read overhead : 52 μs/record – After this patch read overhead : 32 μs/record 17
  • 18. ©2013 LinkedIn Corporation. All Rights Reserved. Improvement #2  Reduce extra data transformations  Solution is to provide custom object inspectors  Results – Current read overhead : 52 μs/record – After this patch read overhead : 30 μs/record 18
  • 19. ©2013 LinkedIn Corporation. All Rights Reserved. Final Results 19 55 32 30 11 0 10 20 30 40 50 60 Trunk Improvement #1 Improvement #2 Combined
  • 20. ©2013 LinkedIn Corporation. All Rights Reserved. Courtesy: iclipart.com
  • 21. ©2013 LinkedIn Corporation. All Rights Reserved. 56%Never Used Hive 44%Use Hive 27%Primarily use Hive Out of all our Hadoop users: Hive User Base at LinkedIn 21 of Hive jobs were from ad-hoc queries32%
  • 22. ©2013 LinkedIn Corporation. All Rights Reserved. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Who uses Hive and who doesn’t 22 Data Scientists Engineers Product Managers Customer Support Specialists Analysts Hive adoption among Hadoop users by job title
  • 23. ©2013 LinkedIn Corporation. All Rights Reserved. Top concerns about Hive 23 Not friendly for long/complex workflows Performance, especially for ad-hoc queries Steep learning curve for tuning Data/UDFs unavailability

Editor's Notes

  • #11: Hive -Adhoc and reporting , business analyticsPig – ETL pipeline, production WFsMR - Highly specialized application Az - LI WF
  • #15: Which processData operation can detect root causeEmail, http address
  • #17: Context of the problem