�ݺ�ߣ

©2013 LinkedIn Corporation. All Rights Reserved.
Hive at LinkedIn

Agenda
 LinkedIn Data and its Ecosystem
 Performance Improvements – Avro
 User experiences
3

LinkedIn Data Sources
 Event Data
– Page Views
– Clicks
– Search queries
 Database Data
– Profile (Users & Companies)
– Connections
 External Data
– Salesforce, DoubleClick
4

Member Data
(Profiles)
Espresso
and RDBMS
External
Partner Data
Member Activity
(Page views,
button clicks)
Kafka Topics
Front-end
Serving
Systems
Member-facing
systems
Lots of cool stuff
not in this picture!
Where's the Data at LinkedIn?
© 2013 LinkedIn 24 June 2013
Data Ecosystem at LinkedIn
5
Member
Facing
Systems

6
Member
Facing
Systems

7
Member
Facing
Systems

8
Member
Facing
Systems

9
Member
Facing
Systems

Data in Hadoop
 Almost all LinkedIn data is stored in Hadoop
 Tools used
– Hive/HCatalog
– Pig
– Java MapReduce
– Azkaban
10

Hive Usage
 Use-cases
– Ad-hoc query
– Reporting
– Building Platforms
 Segmentation Engine
 Experimentations Engine
 Users
– Data Scientist
– Business Analytics
– Security team
– Product team
11

Hive Challenges
 Performance
– Faster query execution
 Performance
– Faster query execution
 Efficient MR* execution plan
– Effective resource usage
– Ensure cluster stability
12

LinkedIn Hive Initiatives
 Make HCatalog work and deploy [OnGoing]
 Hive Performance Improvement (Avro data reading) [On
Going]
 Stabilize Hive Server 2 at LI [About to Start]
 Expand the scope of HCatalog metadata [Planning]
13

HCatalog Initiatives
 Expand scope of meta-data
– Who creates this data?
– What are the inputs?
 Helpful to create data lineage
– Who is the maintainer of data?
14

What is the Problem?
 Reading Avro record takes long time.
– 52 micro-second/record
 Found the hotspot using VisualVm
16

Improvement #1
 Reduce the number of Schema.equals() calls
 Schema equality checks required primarily for evolved
schema.
 Solution includes caching to avoid unnecessary
expensive calls
 Results
– Trunk read overhead : 52 μs/record
– After this patch read overhead : 32 μs/record
17

Improvement #2
 Reduce extra data transformations
 Solution is to provide custom object inspectors
 Results
– Current read overhead : 52 μs/record
– After this patch read overhead : 30 μs/record
18

Final Results
19
55
32
30
11
0
10
20
30
40
50
60
Trunk Improvement #1 Improvement #2 Combined

56%Never Used Hive
44%Use Hive
27%Primarily use Hive
Out of all our Hadoop users:
Hive User Base at LinkedIn
21
of Hive jobs were from ad-hoc queries32%

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Who uses Hive and who doesn’t
22
Data Scientists
Engineers
Product Managers
Customer Support Specialists
Analysts
Hive adoption among Hadoop users by job title

Top concerns about Hive
23
Not friendly for long/complex workflows
Performance, especially for ad-hoc queries
Steep learning curve for tuning
Data/UDFs unavailability

�ݺ�ߣ

Hive at LinkedIn

More Related Content

Hive at LinkedIn

Editor's Notes