This is the Apache Spark session with examples.
It gives a brief idea about Apache Spark. Apache Spark is a fast and general engine for large-scale data processing.
By the end of this presentation you should be fairly clear about Apache Spark.
To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-spark
1 of 15
Downloaded 39 times
More Related Content
Apache spark session
2. Sandeep GiriHadoop
Apache
A fast and general engine for large-scale data processing.
Really fast Hadoop
100x faster than Hadoop MapReduce in memory,
10x faster on disk.
Builds on similar paradigms as Hadoop
Integrated with Hadoop
4. Sandeep GiriHadoop
Login as root
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.4.tgz
tar zxvf spark-1.1.0-bin-hadoop2.4.tgz && rm spark-1.1.0-bin-hadoop2.4.tgz;
mv spark-1.1.0-bin-hadoop2.4 /usr/lib/
cd /usr/lib;
ln -s spark-1.1.0-bin-hadoop2.4/ spark
Login as student
/usr/lib/spark/bin/pyspark
INSTALLING ONYARN
Already Installed on hadoop1.knowbigdata.com
5. Sandeep GiriHadoop
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
A collection of elements partitioned across cluster
lines = sc.textFile('hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/input/big.txt')
RDD Can be persisted in memory
RDD Auto recover from node failures
Can have any data type but has a special dataset type for key-value
Supports two type of operations: transformation and action
Each Element of RDD across cluster is run through map function
7. Sandeep GiriHadoop
SPARK -TRANSFORMATIONS
map(func)
Return a new distributed dataset formed by passing each
element of the source through a function func.
Analogous to foreach of pig.
filter(func)
Return a new dataset formed by selecting those
elements of the source on which func returns true.
flatMap(
func)
Similar to map, but each input item can be mapped to 0
or more output items
groupByKey
([numTasks])
When called on a dataset of (K,V) pairs, returns a
dataset of (K, Iterable<V>) pairs.
See More: sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey,join
https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html
8. Sandeep GiriHadoop
SPARK - ACTIONS
int totalLength = lineLengths.reduce(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
Return value to the driver
9. Sandeep GiriHadoop
SPARK - ACTIONS
reduce(func)
Aggregate elements of dataset using a function:
Takes 2 arguments and returns one
Commutative and associative for parallelism
count() Return the number of elements in the dataset.
collect()
Return all elements of dataset as an array at driver. Used
for small output.
take(n)
Return an array with the first n elements of the dataset.
Not Parallel.
See More: 鍖rst(), takeSample(), takeOrdered(), saveAsTextFile(path)
https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html