際際滷
Submit Search
Hive vs Pig
0 likes
972 views
Anatoliy Nikulin
Follow
弌舒于仆亳于舒亠仄 Hive 亳 Pig. 于亠 ETL 亳亠仄 亟仍 舒弍仂 弍仂仍亳仄亳 亟舒仆仆仄亳/
Read less
Read more
1 of 21
Download now
Download to read offline
More Related Content
Hive vs Pig
1.
舒 仂 仄仆亠
仂仍从仂 舒亰仆 从? Hive vs Pig 亠 亟亠仆 仗仂亠, 仆仂 仗仂仂仄 亰舒 仗 仄亳仆 亟仂仍亠亠 () 仆舒仂仍亳亶 亳从仍亳仆
2.
Hive SQL-like 磶从
HQL 仆亠舒从亳于仆舒 从仂仆仂仍 仂亠仆仆亠 仆从亳亳 舒亞亠亞舒亳亳 仂亟亟亠亢从舒 UDF (Java, Python) 舒仆仆亠 - 从舒从 舒弍仍亳舒
3.
Hive - 仗仍ム
弌舒亶 亟仂弍亶 SQL. 丱仂仂 亟仍 仂仗亳舒仆亳 于弍仂仂从, 亟舒 亳 仗仂仂, 于亠 亠亞仂 亰仆舒ム. 个仂仄亳仂于舒仆亳亠 MR Job 仗仂亟 从舒仗仂仂仄. 丕仂亟亳 仄仆仂亞仂 仂于亠亠亟舒, 于磶舒仆仆仂亞仂 仂弍于磶从仂亶 于仂从亞 MR. 仗亳舒仆亳亠 仄仂亟亠仍亠亶 亟舒仆仆, 于仂亟仆 亳 于仂亟仆 仂仄舒仂于, 亠仗仂亠从 MR 亰舒亟舒. 仆亠舒从亳于仆仂. 丱仂仂 亟仍 舒仆舒仍亳亰舒 亟舒仆仆 于 舒亰仆 亠亰舒. 仂舒 舒亰舒弍仂从亳. 于亳亠 亰舒于亳亳仄仂亠亶, 弍仂从亳, unit-亠仂于
4.
Hive - 仄亳仆
亠 于 仄仂亢仆仂 仍仂亢亳 于 HQL 丐亟仆仂 舒弍仂舒 亟舒舒仄亳 亅仂 仂仂仂 仍仂亢亳 于 亞仂仍仂于, 仗亳 仗仂 于弍仂从舒. 仂 仆亠 仍亠亞从仂 于 仍舒 仍仂亢仆 于弍仂仂从.
5.
Hive - 仗仂仂亶
仗亳仄亠
6.
Hive - 仍仂亢仆亶
仗亳仄亠
7.
从舒从 于舒仄?
(从舒从 仂 亞亠仆仂于舒仂) 亠 于舒舒亠 从舒从仂亶-仂 仗仂磲仂亠仆仆仂亳 舒亰仍仂亢亳 弍 于 仗仂 仗仂仍仂从舒仄...
8.
Pig Pig -
仗仂亠亟仆亶 仗仂亟仂亟 仆亠舒从亳于仆舒 从仂仆仂仍 仂亠仆仆亠 仆从亳亳 舒亞亠亞舒亳亳 仂亟亟亠亢从舒 UDF (Java, Jython) 舒仆仆亠 - 于 于亳亟亠 舒亰仆 从.
9.
Pig - 仗仍ム
仂亠亟仆亶 仗仂亟仂亟. 丕仗仂磲仂亠仆仆仂! 亊亰从 仗仂亰于仂仍磳 舒亰弍亳于舒 仍仂亞亳从 仆舒 弍仍仂从亳, 从舒亢亟亶 舒亞 仄仂亢仆仂 舒亰于亠仆仂 仂仗亳于舒 从仂仄仄亠仆舒亳礆亳. 个仂仄亳仂于舒仆亳亠 MR Job 仗仂亟 从舒仗仂仂仄. 丕仂亟亳 仄仆仂亞仂 仂于亠亠亟舒, 于磶舒仆仆仂亞仂 仂弍于磶从仂亶 于仂从亞 MR. 仗亳舒仆亳亠 仄仂亟亠仍亠亶 亟舒仆仆, 于仂亟仆 亳 于仂亟仆 仂仄舒仂于, 亠仗仂亠从 MR 亰舒亟舒. 仆亠舒从亳于仆仂. 丱仂仂 亟仍 舒仆舒仍亳亰舒 亟舒仆仆 于 舒亰仆 亠亰舒. 仂舒 舒亰舒弍仂从亳. 于亳亠 亰舒于亳亳仄仂亠亶, 弍仂从亳, unit-亠仂于
10.
Pig - 仄亳仆
亠 于 仄仂亢仆仂 仍仂亢亳 于 磶从 Pig Latin Pig Latin 于仄亠亠 仂 从舒仄亳 亟舒仆仆, 弍仂仍亠亠 仍仂亢亠仆, 于 仂仍亳亳亳 仂 HiveQL 仍 UDF 亳仗仂仍亰亠 Jython. 亅仂 仄仂亢亠 仂亞舒仆亳亳 于 亳仗仂仍亰仂于舒仆亳亳 仆亠从仂仂 弍亳弍仍亳仂亠从.
11.
Pig - c从
亟舒仆仆 Tuple - 仗仂磲仂亠仆仆亶 仆舒弍仂 仗仂仍亠亶. 弌从舒, 从 仗仂仍礆 从仂仂仂亶 仄仂亢仆仂 仂弍舒舒 仗仂 亳仆亟亠从 亳/亳仍亳 亳仄亠仆亳 Bag - 从仂仍仍亠从亳 (仄仆仂亢亠于仂) Tuple.
12.
Pig - 仆从亳亳
LOAD STORE GENERATE JOIN GROUP FILTER UNION DISTINCT ORDER
13.
fs -rm -f
-r -skipTrash /analytical_engine/pig/out raw_data = LOAD '/analytical_engine/data/example/' USING PigStorage('t') AS (time, bid_id, user_id, dsp_id, bid:int); raw_data -> tuple 亳仄亠仆仂于舒仆仆仄亳 仗仂仍礆亳 ------------------------------------------------------------------------------------------- time bid_id user_id dsp_id bid ------------------------------------------------------------------------------------------- (2014.02.14 14:08:27.711, 56949, User-id-1, DSP-2, 12) (2014.02.14 14:08:28.712, 61336, 45221696259999, DSP-1, 56) (2014.02.14 14:08:29.713, 74685, 45221699381039, DSP-2, 89) (2014.02.14 14:08:30.714, 56949, 45221695781716, DSP-1, 21) (2014.02.14 14:08:25.715, 27617, 45221682863705, DSP-3, 22) Pig - 亰舒亞亰从舒 亟舒仆仆 (LOAD)
14.
--仂仄舒仍亳亰亠仄 亟舒仆仆亠 norm_data =
FOREACH raw_data GENERATE SUBSTRING(time, 0,10) AS date, dsp_id, bid; norm_data -> tuple 亳仄亠仆仂于舒仆仆仄亳 仗仂仍礆亳 亳 仂弍亠亰舒仆仆仂亶 亟舒仂亶 --------------------------------------- date dsp_id bid --------------------------------------- (2014.02.14, DSP-2, 12) (2014.02.14, DSP-1, 56) (2014.02.14, DSP-2, 89) (2014.02.14, DSP-1, 21) Pig - 亳亠舒亳于仆舒 仂弍舒弍仂从舒 亟舒仆仆 (FOREACH - GENERATE)
15.
--弌亞仗仗亳亠仄 仗仂 dsp_id
亳 date group_norm_data = GROUP norm_data BY (dsp_id, date); group_norm_data -> (亞仗仗舒 从舒从 从仍ム) : [ (norm_data), (norm_data) ] ------------------------------------------------------------------------------------------------------------- group array [norm_data, ...] ------------------------------------------------------------------------------------------------------------- ( (DSP-1, 2014.02.14), {(2014.02.14, DSP-1, 56), (2014.02.14, DSP-1, 21)} ) ( (DSP-1, 2014.02.17), {(2014.02.17, DSP-1, 34), (2014.02.17, DSP-1, 24)} ) ( (DSP-2, 2014.02.14), {(2014.02.14, DSP-2, 89), (2014.02.14, DSP-2, 12)} ) Pig - 亞仗仗亳仂于从舒 亟舒仆仆 (GROUP) 仗亳仂从 舒亞亠亞舒仂于 c 仗亠亳从仂仄 norm_data
16.
-- 舒亰于仂舒亳于舒亠 舒亞亠亞舒
于 仍亳仆亠亶仆 从 ft_group_norm_data = FOREACH group_norm_data GENERATE FLATTEN (group), FLATTEN(norm_data); ft_group_norm_data -> tuple 亳仄亠仆仂于舒仆仆仄亳 仗仂仍礆亳 ---------------------------------------------------------------------- dsp_id, date date dsp_id bid ----------------------------------------------------------------------- (DSP-1, 2014.02.14, 2014.02.14, DSP-1, 56) (DSP-1, 2014.02.14, 2014.02.14, DSP-1, 21) (DSP-1, 2014.02.15, 2014.02.15, DSP-1, 15) (DSP-1, 2014.02.15, 2014.02.15, DSP-1, 31) Pig - 舒亰于亠从舒 舒亞亠亞舒仂于 (FLATTEN)
17.
--仂亳舒亠仄 从仂仍亳亠于仂 舒亞亠亞舒仂于
于 从舒亢亟仂亶 亞仗仗亠 count_agg = FOREACH group_norm_data GENERATE group, COUNT (norm_data); count_agg -> 亞仗仗舒 : $0 /仆亠 亳仄亠仆仂于舒仆仆仂亠 仗仂仍亠, .从 仄 仆亠 亳仗仂仍亰仂于舒仍亳 AS / ------------------------------------------------------ group $0 (count) ------------------------------------------------------ ( (DSP-1, 2014.02.14), 2 ) ( (DSP-1, 2014.02.15), 3 ) ( (DSP-1, 2014.02.16), 2 ) Pig - 仆从亳亳 舒亞亠亞舒亳亳 (COUNT)
18.
--亳仍亳仄 仄仄 舒于仂从
亟亠仍舒仆仆仂亶 从舒亢亟仄 弍仂从亠仂仄 (dsp_id) sum_bids_dsp = FOREACH group_norm_data GENERATE group, SUM (norm_data.bid) AS bids_sum; sum_bids_dsp -> 亞仗仗舒 : bids_sum ------------------------------------------------------ group bids_sum ------------------------------------------------------ ( (DSP-1, 2014.02.16), 82) ( (DSP-1, 2014.02.17), 58) ( (DSP-2, 2014.02.14), 101) ( (DSP-2, 2014.02.16), 58) Pig - 仆从亳亳 舒亞亠亞舒亳亳 (SUM)
19.
--亳仍亳仄 仂弍 仄仄,
亳 从仂仍亳亠于仂 亞仗仗. --仍 仂亞仂 仄亠亢亳仄 于 于仄亠亠. group_all = GROUP sum_bids_dsp ALL; ( all, { ((DSP-1,2014.02.14),77), ((DSP-1,2014.02.15),67), ((DSP-1,2014.02.16), 82),((DSP-1,2014.02.17),58),((DSP-2,2014.02.14),101),((DSP-2,2014.02.16),58), ((DSP-2,2014.02.17),123),((DSP-3,2014.02.14),22),((DSP-3,2014.02.15),109), ((DSP-3,2014.02.16),136),((DSP-3,2014.02.17),81) } ) --亳 于亳仍亳仄 从仂仍亳亠于仂 亳 仄仄 summary = FOREACH group_all GENERATE COUNT(sum_bids_dsp), SUM (sum_bids_dsp.bids_sum); ------------------------------------------------------ count sum ------------------------------------------------------ (11, 914) Pig - GROUP ALL
20.
亠亰ミ斜 Hive 仂仂 亟仍
仆亠弍仂仍亳 亳 仆亠仍仂亢仆 于弍仂仂从. HQL 仗仂仂亢 仆舒 SQL, 仗仂仂仄 仄仂亢仆仂 仂亠仆 弍仂 仆舒舒 舒弍仂舒 Hive. Pig 丐亠弍亠 亳亰亠仆亳 磶从舒 亳 从 亟舒仆仆. 仂 亰舒仂, 舒亰仂弍舒于亳 仂亟亳仆 舒亰, 于 仗仂仍舒亠亠 弍仂仍亠亠 仄仂仆亶 亳仆仄亠仆, 于 从仂仂仂仄 仍亠亞亠 亠舒仍亳亰仂于于舒 仍仂亢仆亠 亳 仄仆仂亞仂仗亠仆舒亠 于弍仂从亳. 仗仂仍舒亠亠 仗仂仂亶 亳 仗仂磲仂亠仆仆亶 从仂亟, 亟仂仗仆仄亳 亳 仄亠仆仄亳 从仂仄仄亠仆舒亳礆亳
21.
仂仗仂?
Download