際際滷

際際滷Share a Scribd company logo
Introduction to PigXiafei.qiu@PCA
Nested Data ModelField, Tuple, Bag, Map
Normal OperatorsArithmetic OperatorsX = FOREACH A GENERATE f1, f2, f1 % f2;Boolean OperatorsX = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));Cast operatorsX = FOREACH B GENERATE group, (chararray) COUNT(A) AS total;Comparison OperatorsX = FILTER A BY (f1 matches '.*apache.*') OR (NOT (f2+f3 > f1));Flatten OperatorTuple: remove a level of nestingBag :remove a level of nesting, may cause cross product
Normal Operators
Relational OperatorsLOADa bag of tuplesA = LOAD 'data' [USING function] [AS schema];STOREA = STORE alias INTO 'directory' [USING function];FOREACHtuple in the bag, produce a new tupleA = FOREACH queries GENERATE uid, expandQuery(query);FILTERa bag to produce a subset of itA = FILTER queries BY uidneq bot OR notBot(uid);
Relational OperatorsCOGROUP/GROUPone or less than 127 relationsalias = GROUP  by ,  by{group: int, A: {name: chararray,age: int,gpa: float}}(18,{(John,18,4.0F),(Joe,18,3.8F)})
Relational OperatorsJOIN(inner/outer)Replicated Joinsone or more relations are small enough to fit into main memory.Skewed Joinscomputes a histogram of the key space and uses this data to allocate reducers for a given key.Merge JoinsSorted鐚 perform join on map phase
Relational Operators
Relational OperatorsORDERalias by filed DESC/ASCUnstableSPLITalias INTO alias IF , alias IF CROSScross productX = CROSS A, B;DISTINCTRemoves duplicate tuples in a relation.X = DISTINCT A;LIMITLIMITE A 3;SAMPLESAMPLE alias size;IMPORTImport other .pig fileDEFINEDefine a Pig macro.
Built In Eval FunctionAVG/MAX/MIN/SUM on a single column of a bag; group it firstCOUNT/ COUNT_STAR number of elements in a bag; COUNT_STAR  counts nullCONCATDIFFIsEmptySIZETOKENIZE
Other Built In FunctionLoad/Store FunctionsMath FunctionsString Functions
Map-Reduce Plan CompilationCompile each GROUP into distinct Map-Reduce jobPush commands between LOAD and GROUP to the Map SideCommands between subsequent GROUP Gi and Gi+1 pushed into the Reduce  Side of Gi
Map-Reduce Plan CompilationORDER is compiled into two map-reduce jobs.MR1: sample the key spaceMR2: sort
User Defined FunctionSimple Eval FunctionpublicclassUPPERextendsEvalFunc<String>{publicStringexec(Tupleinput)throwsIOException{// .......}}
User Defined FunctionAggregate FunctionsAlgebraic Interfacethey can be computed incrementally in a distributed fashion.Accumulator Interfacedesigned to decrease memory usage
Accumulator InterfacepublicinterfaceAccumulator<T>{publicvoidaccumulate(Tupleb)throwsIOException;publicTgetValue();publicvoidcleanup();}
Aggregate FunctionspublicinterfaceAlgebraic{publicStringgetInitial();publicStringgetIntermed();publicStringgetFinal();}

More Related Content

Introduction to pig