Pig is a platform for analyzing large datasets that uses a simple declarative language to express data flow tasks. It has a nested data model of fields, tuples, bags, and maps and supports common operators like FILTER, FOREACH, JOIN, GROUP, and ORDER. User-defined functions can extend its built-in functionality. Pig compiles queries into multiple MapReduce jobs as needed to perform the work in parallel across a cluster.
3. Normal OperatorsArithmetic OperatorsX = FOREACH A GENERATE f1, f2, f1 % f2;Boolean OperatorsX = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));Cast operatorsX = FOREACH B GENERATE group, (chararray) COUNT(A) AS total;Comparison OperatorsX = FILTER A BY (f1 matches '.*apache.*') OR (NOT (f2+f3 > f1));Flatten OperatorTuple: remove a level of nestingBag :remove a level of nesting, may cause cross product
5. Relational OperatorsLOADa bag of tuplesA = LOAD 'data' [USING function] [AS schema];STOREA = STORE alias INTO 'directory' [USING function];FOREACHtuple in the bag, produce a new tupleA = FOREACH queries GENERATE uid, expandQuery(query);FILTERa bag to produce a subset of itA = FILTER queries BY uidneq bot OR notBot(uid);
6. Relational OperatorsCOGROUP/GROUPone or less than 127 relationsalias = GROUP by , by{group: int, A: {name: chararray,age: int,gpa: float}}(18,{(John,18,4.0F),(Joe,18,3.8F)})
7. Relational OperatorsJOIN(inner/outer)Replicated Joinsone or more relations are small enough to fit into main memory.Skewed Joinscomputes a histogram of the key space and uses this data to allocate reducers for a given key.Merge JoinsSorted鐚 perform join on map phase
9. Relational OperatorsORDERalias by filed DESC/ASCUnstableSPLITalias INTO alias IF , alias IF CROSScross productX = CROSS A, B;DISTINCTRemoves duplicate tuples in a relation.X = DISTINCT A;LIMITLIMITE A 3;SAMPLESAMPLE alias size;IMPORTImport other .pig fileDEFINEDefine a Pig macro.
10. Built In Eval FunctionAVG/MAX/MIN/SUM on a single column of a bag; group it firstCOUNT/ COUNT_STAR number of elements in a bag; COUNT_STAR counts nullCONCATDIFFIsEmptySIZETOKENIZE
11. Other Built In FunctionLoad/Store FunctionsMath FunctionsString Functions
12. Map-Reduce Plan CompilationCompile each GROUP into distinct Map-Reduce jobPush commands between LOAD and GROUP to the Map SideCommands between subsequent GROUP Gi and Gi+1 pushed into the Reduce Side of Gi
14. User Defined FunctionSimple Eval FunctionpublicclassUPPERextendsEvalFunc<String>{publicStringexec(Tupleinput)throwsIOException{// .......}}
15. User Defined FunctionAggregate FunctionsAlgebraic Interfacethey can be computed incrementally in a distributed fashion.Accumulator Interfacedesigned to decrease memory usage