ݺߣ

ݺߣShare a Scribd company logo
1
? ??? ???
2016.01.18
???
spark SQL
2
??
? Spark SQL ??
? Tungsten execution engine
? Catalyst optimizer
? RDD, DataFrame
? Dataset
? Spark SQL application (in Java)
? Linking
? example
? ?? ??
3
Spark SQL
? ??/??? ???(structured/semi-structured Data) ??? ???
Spark Library
? Tungsten execution engine & Catalyst optimizer ??
? ??? interface ??
? SQL, HiveSQL queries
? Dataframe API
? Scala, Java, Python, and R?? ?? ??
? Spark 1.3
? Dataset API
? Scala, Java ?? ?? ??
? Spark 1.6
? ??? Input source? ??
? RDD & ?? ???
? JSON ??? ?
? Parquet file
? Hive Table
? ODBC/JCBC ???? ??
4
Tungsten execution engine
? ???? bottleneck??
? I/O? network bandwidth? ??
? High bandwidth SSD & striped HDD? ??
? 10Gbps network? ??
? CPU ? memory?? bottleneck ??? ??
? ???? processing workload? ??
? Disk I/O? ????? ?? input data pruning workload
? Shuffle? ?? serialization? hashing? ?? key bottleneck
? CPU? memory? ??? ???
? ????? ??? ??? ??? ??? ? ?? System Engine? ??!
? Project Tungsten
? Spark 1.4?? DataFrame? ??
? spark 1.6?? Dataset?? ??
??:
1. Project Tungsten C databrick
2. https://issues.apache.org/jira/browse/SPARK-7075
5
Tungsten execution engine C three Goal
? Memory Management and Binary Processing
? JVM object ??? garbage collection? overhead? ?
=>????? data? ??? ?? Java objects ??? binary format?? ??
? ???? ?? ???? ??? ??? ??
=> denser in-memory data format? ???? ??? ???? ??? ?
? ?? memory accounting (size of bytes) ?? ??(??? Heuristics ??)
? ?? ???? ??? domain semantics? ??? ??? data processing? ???? ?
=>binary format? in memory data? ???? data type? ???? operator? ??
(serialization/deserialization ?? data processing)
? Cache-aware Computation
? sorting and hashing for aggregations, joins, and shuffle? ??? ??? ??
=>memory hierarchy? ???? algorithm? data struncture
? Code Generation
? expression evaluation, DataFrame/SQL operators, serializer? ?? ??? ??
=>modern compilers and CPUs? ??? ??? ??? ? ?? code generation
6
Catalyst optimizer
? ??? ??? ????? ?? ??? ?? ??? extensible optimizer
? extensible design
? ??? optimization techniques? feature?? ??? ??
? ?? ???? optimizer? ???? ???? ??
? Catalyst? ?? (In Spark SQL) ( ??? ??? paper ??)
? Tree ??? ???? optimization
? ?? 4??? ??
?? ??: Catalyst Optimizer - databrick
x+(1+2) ? tree ?? Catalyst? phase
7
RDD, DataFrame
DataFrames / SQL
Structured Binary Data (Tungsten)
? High level relational operation ?? ??
? Catalyst optimization ?? ??
? Lower memory pressure
? Memory accounting (avoid OOMs)
? Faster sorting / hashing / serialization
RDDs
Collections of Native JVM Objects
? ?? ?? ?? data type ??? ??
? Compile-time type-safety ??
? ??? ????? ??
? ? ?? ?? ??? ?? ???..
? ?? ?? cost
? ??? ?? boilerplate(?? ??, ??) ?? ??
? ?? ??? ???? API? ??? ? ????
? Catalyst optimizer & Tungsten execution engine? ??? ??? ? ??? ?
? Domain object? type? ??? ? ?? ?? ??? ? ??? ?
?? ??? & ???
8
Dataset
? RDD? DataFrame? ??? ?? ?? interface API
? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html
? ???
? Fast
? Typesafe
? Support for a variety of object models
? Java Compatible
? Interoperates with DataFrames
Dataset
Structured Binary Data (Tungsten)
? High level relational operation ?? ??
? Catalyst optimization ?? ??
? Lower memory pressure
? Memory accounting (avoid OOMs)
? Faster sorting / hashing / serialization
? ?? ?? ?? data type ??? ??
? Compile-time type-safety ??
? ??? ????? ??
?? ??: technicaltidbit.blogspot.kr/
9
Dataset
? Encoder
? Dataset? ??? ??? Structured Binary data? ??
? JVM object? RDD? ??, DataFrame?? ??? ??
? Processing ? ??? ???? serialization? ??
? RDD/DataFrame type? data? Dataset?? ???? ????
Object? ??? ?? ??? Encoder? ??(?? ????? ???? ??)
? ?? ???? ???? java, kyro Serialization? ?? ?? ???
Data Serialization ?? ?? ?? ??: Introducing Spark Datasets- databrick
10
Dataset
? structured/semi-structured Data ?? => Dataset??
?? RDD DataFrame Dataset
?? ??? ?? ?? ??
??? ?? ?? ?? ??? ???
Type-safety ?? ???? ?? ??
?? ??? ??? ??? ???
RDD-Dataset?
WordCount ???
?? ?? ??
RDD-Dataset?
?? ? memory
??? ??
?? ??: Introducing Spark Datasets- databrick
11
Spark SQL application (in Java)
? Linking
? Pom.xml? ?? ?? ??
12
Spark SQL application (in Java)
? sample
examples/src/main/resources/people.json
examples/src/main/resources/people.txt
13
Spark SQL application (in Java)
? (DataFrame example 1) Jason File
SparkConf sparkConf = new SparkConf().setAppName("dataFrame");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx);
DataFrame df = sqlContext.read().json("examples/src/main/resources/people.json");
df.show();//1
df.printSchema();//2
df.select("name").show(); //3
df.select(df.col("name"), df.col("age").plus(1)).show(); //4
df.filter(df.col("age").gt(21)).show(); //5
df.groupBy("age").count().show();//6
df.registerTempTable("people");
DataFrame results = sqlContext.sql("SELECT name FROM people");
List<String> names = results.javaRDD().map(new Function<Row, String>() {
public String call(Row row) {return "Name: " + row.getString(0); }
}).collect();
for(String tuple : names){ //7
System.out.println(tuple);
}
ctx.stop();
1.
2.
3. 4.
5.
6.
7.
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
14
Spark SQL application (in Java)
? (DataFrame example 2) Text File : Specifying the Schema
SparkConf sparkConf = new SparkConf().setAppName("dataFrame");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx);
JavaRDD<String> people = ctx.textFile("examples/src/main/resources/people.txt");
String schemaString = "name age";
List<StructField> fields = new ArrayList<StructField>();
for (String fieldName: schemaString.split(" ")) {
fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = people.map(new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return RowFactory.create(fields[0], fields[1].trim());
}
});
DataFrame peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);
peopleDataFrame.show();//1
peopleDataFrame.printSchema();//2
ctx.stop();
1.
2.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.DataFrame;
import java.util.List;
15
Spark SQL application (in Java)
? (DataFrame example 3) Text File : Inferring the Schema(JavaBean)
import java.io.Serializable;
public class Person implements Serializable {
private String name;
private int age;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
}
16
Spark SQL application (in Java)
? (DataFrame example 3) Text File : Inferring the Schema(JavaBean)
SparkConf sparkConf = new SparkConf().setAppName("dataFrame");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx);
JavaRDD<Person> people = ctx.textFile("examples/src/main/resources/people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}});
DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class);
schemaPeople.registerTempTable("people");
DataFrame teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19");
teenagers.show();//1
teenagers.printSchema();//2
ctx.stop();
1.
2.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import java.util.ArrayList;
import java.util.List;
17
Spark SQL application (in Java)
? (Dataset example)
SparkConf sparkConf = new SparkConf().setAppName("dataset");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx);
...
DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class);
Dataset<Person> schools = schemaPeople.as(Encoders.bean(Person.class));
Dataset<String> strings = schools.map(new BuildString(), Encoders.STRING());
//Dataset<String> strings = schools.map(p-> p.getName()+" is "+ p.getAge()+" years old.", Encoders.STRING());
List<String> result = strings.collectAsList();
for(String tuple : result){//1
System.out.println(tuple);
}
ctx.stop();
class BuildString implements MapFunction<Person, String>
{
public String call(Person p) throws Exception {
return p.getName() + " is " + p.getAge() + "
years old.";
}
}
1.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import java.util.List;
18
Spark SQL application (in Java)
? (Dataset example) Encoder
? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Encoder.html
? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Encoders.html
1. Primitive type ???
List<String> data = Arrays.asList("abc", "abc", "xyz");
Dataset<String> ds = context.createDataset(data, Encoders.STRING());
2. tuple type(K,V pair) ???
Encoder<Tuple2<Integer, String>> encoder2 = Encoders.tuple(Encoders.INT(), Encoders.STRING());
List<Tuple2<Integer, String>> data2 = Arrays.asList(new scala.Tuple2(1, "a");
Dataset<Tuple2<Integer, String>> ds2 = context.createDataset(data2, encoder2);
3. Java Beans? ??? reference type ???
Encoders.bean(MyClass.class);
19
?? ??
? Dataset
? https://issues.apache.org/jira/browse/SPARK-9999
? http://spark.apache.org/docs/latest/sql-programming-guide.html
? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html
? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Encoder.html
? https://databricks.com/blog/2015/11/20/announcing-spark-1-6-preview-in-databricks.html
? https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html
? http://technicaltidbit.blogspot.kr/2015/10/spark-16-datasets-best-of-rdds-and.html
? http://www.slideshare.net/databricks/apache-spark-16-presented-by-databricks-cofounder-patrick-wendell
? Tungsten
? https://issues.apache.org/jira/browse/SPARK-7075
? https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
? catalyst
? https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
? Michael Armbrust et al. Spark SQL: Relational Data Processing in Spark, In SIGMOD , 2015

More Related Content

What's hot (20)

?? ??? ? ???? ???? ?? ?? (????? KUCC, 2022? 4?)
?? ??? ? ???? ???? ?? ?? (????? KUCC, 2022? 4?)?? ??? ? ???? ???? ?? ?? (????? KUCC, 2022? 4?)
?? ??? ? ???? ???? ?? ?? (????? KUCC, 2022? 4?)
Suhyun Park
?
Apache Spark
Apache SparkApache Spark
Apache Spark
ssuser09ca0c1
?
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Databricks
?
[NDC18] ??? ? ???? ??? ????? ???: ?? ??? ?? ?? ?? (2?)
[NDC18] ??? ? ???? ??? ????? ???: ?? ??? ?? ?? ?? (2?)[NDC18] ??? ? ???? ??? ????? ???: ?? ??? ?? ?? ?? (2?)
[NDC18] ??? ? ???? ??? ????? ???: ?? ??? ?? ?? ?? (2?)
Hyojun Jeon
?
[135] ???? ????????? ??????????? ??? ?? ???????? ????????
[135] ???? ????????? ??????????? ??? ?? ???????? ????????[135] ???? ????????? ??????????? ??? ?? ???????? ????????
[135] ???? ????????? ??????????? ??? ?? ???????? ????????
NAVER D2
?
?? ????? ???? ? ???? ??? ??? ??
?? ????? ???? ? ???? ??? ??? ???? ????? ???? ? ???? ??? ??? ??
?? ????? ???? ? ???? ??? ??? ??
Wonha Ryu
?
?? ???? ??
?? ???? ???? ???? ??
?? ???? ??
?? ?
?
[Play.node] node.js ? ??? ??? ???(+??) ???
[Play.node] node.js ? ??? ??? ???(+??) ???[Play.node] node.js ? ??? ??? ???(+??) ???
[Play.node] node.js ? ??? ??? ???(+??) ???
Dan Kang (???)
?
NoSQL ??? MMORPG ????
NoSQL ??? MMORPG ????NoSQL ??? MMORPG ????
NoSQL ??? MMORPG ????
Hoyoung Choi
?
ɌߤƤäȤޤȤʴѱٱdzdz.⤤Ǥ
ɌߤƤäȤޤȤʴѱٱdzdz.⤤ǤɌߤƤäȤޤȤʴѱٱdzdz.⤤Ǥ
ɌߤƤäȤޤȤʴѱٱdzdz.⤤Ǥ
johgus johgus
?
Apache spark ?? ? ??
Apache spark ?? ? ??Apache spark ?? ? ??
Apache spark ?? ? ??
?? ?
?
jBPM6 Updates
jBPM6 UpdatesjBPM6 Updates
jBPM6 Updates
Kris Verlaenen
?
Linux memory
Linux memoryLinux memory
Linux memory
ericrain911
?
2018 ?????? for ???
2018 ?????? for ???2018 ?????? for ???
2018 ?????? for ???
Yu Yongwoo
?
???, ?? ?? ?????? ?? ? - ?? ?? ???? ??? ??, NDC2019
???, ?? ?? ?????? ?? ? - ?? ?? ???? ??? ??, NDC2019???, ?? ?? ?????? ?? ? - ?? ?? ???? ??? ??, NDC2019
???, ?? ?? ?????? ?? ? - ?? ?? ???? ??? ??, NDC2019
devCAT Studio, NEXON
?
[NDC2016] TERA ??? Modern C++ ???
[NDC2016] TERA ??? Modern C++ ???[NDC2016] TERA ??? Modern C++ ???
[NDC2016] TERA ??? Modern C++ ???
Sang Heon Lee
?
Spark & Zeppelin? ??? ???? ?? ???
Spark & Zeppelin? ??? ???? ?? ???Spark & Zeppelin? ??? ???? ?? ???
Spark & Zeppelin? ??? ???? ?? ???
Taejun Kim
?
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
?
Cgroups in android
Cgroups in androidCgroups in android
Cgroups in android
ramalinga prasad tadepalli
?
[AIS 2018] [Team Tools_Basic] Confluence? ??? ??? - ?????
[AIS 2018] [Team Tools_Basic] Confluence? ??? ??? - ?????[AIS 2018] [Team Tools_Basic] Confluence? ??? ??? - ?????
[AIS 2018] [Team Tools_Basic] Confluence? ??? ??? - ?????
Atlassian ????
?
?? ??? ? ???? ???? ?? ?? (????? KUCC, 2022? 4?)
?? ??? ? ???? ???? ?? ?? (????? KUCC, 2022? 4?)?? ??? ? ???? ???? ?? ?? (????? KUCC, 2022? 4?)
?? ??? ? ???? ???? ?? ?? (????? KUCC, 2022? 4?)
Suhyun Park
?
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Databricks
?
[NDC18] ??? ? ???? ??? ????? ???: ?? ??? ?? ?? ?? (2?)
[NDC18] ??? ? ???? ??? ????? ???: ?? ??? ?? ?? ?? (2?)[NDC18] ??? ? ???? ??? ????? ???: ?? ??? ?? ?? ?? (2?)
[NDC18] ??? ? ???? ??? ????? ???: ?? ??? ?? ?? ?? (2?)
Hyojun Jeon
?
[135] ???? ????????? ??????????? ??? ?? ???????? ????????
[135] ???? ????????? ??????????? ??? ?? ???????? ????????[135] ???? ????????? ??????????? ??? ?? ???????? ????????
[135] ???? ????????? ??????????? ??? ?? ???????? ????????
NAVER D2
?
?? ????? ???? ? ???? ??? ??? ??
?? ????? ???? ? ???? ??? ??? ???? ????? ???? ? ???? ??? ??? ??
?? ????? ???? ? ???? ??? ??? ??
Wonha Ryu
?
?? ???? ??
?? ???? ???? ???? ??
?? ???? ??
?? ?
?
[Play.node] node.js ? ??? ??? ???(+??) ???
[Play.node] node.js ? ??? ??? ???(+??) ???[Play.node] node.js ? ??? ??? ???(+??) ???
[Play.node] node.js ? ??? ??? ???(+??) ???
Dan Kang (???)
?
NoSQL ??? MMORPG ????
NoSQL ??? MMORPG ????NoSQL ??? MMORPG ????
NoSQL ??? MMORPG ????
Hoyoung Choi
?
ɌߤƤäȤޤȤʴѱٱdzdz.⤤Ǥ
ɌߤƤäȤޤȤʴѱٱdzdz.⤤ǤɌߤƤäȤޤȤʴѱٱdzdz.⤤Ǥ
ɌߤƤäȤޤȤʴѱٱdzdz.⤤Ǥ
johgus johgus
?
Apache spark ?? ? ??
Apache spark ?? ? ??Apache spark ?? ? ??
Apache spark ?? ? ??
?? ?
?
2018 ?????? for ???
2018 ?????? for ???2018 ?????? for ???
2018 ?????? for ???
Yu Yongwoo
?
???, ?? ?? ?????? ?? ? - ?? ?? ???? ??? ??, NDC2019
???, ?? ?? ?????? ?? ? - ?? ?? ???? ??? ??, NDC2019???, ?? ?? ?????? ?? ? - ?? ?? ???? ??? ??, NDC2019
???, ?? ?? ?????? ?? ? - ?? ?? ???? ??? ??, NDC2019
devCAT Studio, NEXON
?
[NDC2016] TERA ??? Modern C++ ???
[NDC2016] TERA ??? Modern C++ ???[NDC2016] TERA ??? Modern C++ ???
[NDC2016] TERA ??? Modern C++ ???
Sang Heon Lee
?
Spark & Zeppelin? ??? ???? ?? ???
Spark & Zeppelin? ??? ???? ?? ???Spark & Zeppelin? ??? ???? ?? ???
Spark & Zeppelin? ??? ???? ?? ???
Taejun Kim
?
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
?
[AIS 2018] [Team Tools_Basic] Confluence? ??? ??? - ?????
[AIS 2018] [Team Tools_Basic] Confluence? ??? ??? - ?????[AIS 2018] [Team Tools_Basic] Confluence? ??? ??? - ?????
[AIS 2018] [Team Tools_Basic] Confluence? ??? ??? - ?????
Atlassian ????
?

Viewers also liked (20)

2.apache spark ??
2.apache spark ??2.apache spark ??
2.apache spark ??
?? ?
?
Why FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberWhy FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave Gruber
JAX London
?
Spark ?? 2?
Spark ?? 2?Spark ?? 2?
Spark ?? 2?
Jinho Yoo
?
Spark overview ???(SK C&C)_??? ??? ??_20141106
Spark overview ???(SK C&C)_??? ??? ??_20141106Spark overview ???(SK C&C)_??? ??? ??_20141106
Spark overview ???(SK C&C)_??? ??? ??_20141106
SangHoon Lee
?
SPARK SQL
SPARK SQLSPARK SQL
SPARK SQL
Juhui Park
?
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
?
Spark machine learning & deep learning
Spark machine learning & deep learningSpark machine learning & deep learning
Spark machine learning & deep learning
hoondong kim
?
Apache Zeppelin???? ?????? ??????????
Apache Zeppelin???? ?????? ??????????Apache Zeppelin???? ?????? ??????????
Apache Zeppelin???? ?????? ??????????
SangWoo Kim
?
[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark
NAVER D2
?
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
?
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
?
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation Engines
Adam Rogers
?
[NoSQL] 2?. ??? ??? ??
[NoSQL] 2?. ??? ??? ??[NoSQL] 2?. ??? ??? ??
[NoSQL] 2?. ??? ??? ??
kidoki
?
9?. ?? ??????
9?. ?? ??????9?. ?? ??????
9?. ?? ??????
kidoki
?
???? Iso21500vs pmbok-???
????   Iso21500vs pmbok-???????   Iso21500vs pmbok-???
???? Iso21500vs pmbok-???
Byeong Ju Bae
?
[NDC 2011] ?? ???? ?? ?????? ??
[NDC 2011] ?? ???? ?? ?????? ??[NDC 2011] ?? ???? ?? ?????? ??
[NDC 2011] ?? ???? ?? ?????? ??
Hoon Park
?
20141214 ???????? - ??? ? ??? ?? (Similarity&Clustering)
20141214 ???????? - ??? ? ??? ?? (Similarity&Clustering) 20141214 ???????? - ??? ? ??? ?? (Similarity&Clustering)
20141214 ???????? - ??? ? ??? ?? (Similarity&Clustering)
Tae Young Lee
?
???? ??? ?? ??? 2 ????? : ??? ??? ???? ??????
???? ??? ?? ??? 2 ????? : ??? ??? ???? ?????????? ??? ?? ??? 2 ????? : ??? ??? ???? ??????
???? ??? ?? ??? 2 ????? : ??? ??? ???? ??????
????
?
Machine learning & security. Detect atypical behaviour in logs
Machine learning & security. Detect atypical behaviour in logsMachine learning & security. Detect atypical behaviour in logs
Machine learning & security. Detect atypical behaviour in logs
Alexander Melnychuk
?
Spark? Hadoop, ??? ?? (???)
Spark? Hadoop, ??? ?? (???)Spark? Hadoop, ??? ?? (???)
Spark? Hadoop, ??? ?? (???)
Teddy Choi
?
2.apache spark ??
2.apache spark ??2.apache spark ??
2.apache spark ??
?? ?
?
Why FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberWhy FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave Gruber
JAX London
?
Spark overview ???(SK C&C)_??? ??? ??_20141106
Spark overview ???(SK C&C)_??? ??? ??_20141106Spark overview ???(SK C&C)_??? ??? ??_20141106
Spark overview ???(SK C&C)_??? ??? ??_20141106
SangHoon Lee
?
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
?
Spark machine learning & deep learning
Spark machine learning & deep learningSpark machine learning & deep learning
Spark machine learning & deep learning
hoondong kim
?
Apache Zeppelin???? ?????? ??????????
Apache Zeppelin???? ?????? ??????????Apache Zeppelin???? ?????? ??????????
Apache Zeppelin???? ?????? ??????????
SangWoo Kim
?
[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark
NAVER D2
?
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
?
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
?
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation Engines
Adam Rogers
?
[NoSQL] 2?. ??? ??? ??
[NoSQL] 2?. ??? ??? ??[NoSQL] 2?. ??? ??? ??
[NoSQL] 2?. ??? ??? ??
kidoki
?
9?. ?? ??????
9?. ?? ??????9?. ?? ??????
9?. ?? ??????
kidoki
?
???? Iso21500vs pmbok-???
????   Iso21500vs pmbok-???????   Iso21500vs pmbok-???
???? Iso21500vs pmbok-???
Byeong Ju Bae
?
[NDC 2011] ?? ???? ?? ?????? ??
[NDC 2011] ?? ???? ?? ?????? ??[NDC 2011] ?? ???? ?? ?????? ??
[NDC 2011] ?? ???? ?? ?????? ??
Hoon Park
?
20141214 ???????? - ??? ? ??? ?? (Similarity&Clustering)
20141214 ???????? - ??? ? ??? ?? (Similarity&Clustering) 20141214 ???????? - ??? ? ??? ?? (Similarity&Clustering)
20141214 ???????? - ??? ? ??? ?? (Similarity&Clustering)
Tae Young Lee
?
???? ??? ?? ??? 2 ????? : ??? ??? ???? ??????
???? ??? ?? ??? 2 ????? : ??? ??? ???? ?????????? ??? ?? ??? 2 ????? : ??? ??? ???? ??????
???? ??? ?? ??? 2 ????? : ??? ??? ???? ??????
????
?
Machine learning & security. Detect atypical behaviour in logs
Machine learning & security. Detect atypical behaviour in logsMachine learning & security. Detect atypical behaviour in logs
Machine learning & security. Detect atypical behaviour in logs
Alexander Melnychuk
?
Spark? Hadoop, ??? ?? (???)
Spark? Hadoop, ??? ?? (???)Spark? Hadoop, ??? ?? (???)
Spark? Hadoop, ??? ?? (???)
Teddy Choi
?

Similar to Spark sql (20)

Cloudera session seoul - Spark bootcamp
Cloudera session seoul - Spark bootcampCloudera session seoul - Spark bootcamp
Cloudera session seoul - Spark bootcamp
Sang-bae Lim
?
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache Tajo
Gruter
?
[2015 07-06-???] Oracle ?? ??? ? ?? ??? 4
[2015 07-06-???] Oracle ?? ??? ? ?? ??? 4[2015 07-06-???] Oracle ?? ??? ? ?? ??? 4
[2015 07-06-???] Oracle ?? ??? ? ?? ??? 4
Seok-joon Yun
?
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)
Gruter
?
????? ??: ???? ??
????? ??: ???? ??????? ??: ???? ??
????? ??: ???? ??
Leonardo YongUk Kim
?
7?? ??? ?? ??????
7?? ??? ??  ??????7?? ??? ??  ??????
7?? ??? ?? ??????
Sunggon Song
?
?????? ????????? ???? ???? ????? ???????
?????? ????????? ???? ???? ????? ????????????? ????????? ???? ???? ????? ???????
?????? ????????? ???? ???? ????? ???????
Yeonhee Kim
?
Flamingo (FEA) Spark Designer
Flamingo (FEA) Spark DesignerFlamingo (FEA) Spark Designer
Flamingo (FEA) Spark Designer
BYOUNG GON KIM
?
Spark_Overview_qna
Spark_Overview_qnaSpark_Overview_qna
Spark_Overview_qna
?? ?
?
Learning spark ch1-2
Learning spark ch1-2Learning spark ch1-2
Learning spark ch1-2
HyeonSeok Choi
?
spark database Service
spark database Servicespark database Service
spark database Service
?? ?
?
Golang Project Guide from A to Z: From Feature Development to Enterprise Appl...
Golang Project Guide from A to Z: From Feature Development to Enterprise Appl...Golang Project Guide from A to Z: From Feature Development to Enterprise Appl...
Golang Project Guide from A to Z: From Feature Development to Enterprise Appl...
Kyuhyun Byun
?
Scala, Spring-Boot, JPA? ?????? ??? ??
Scala, Spring-Boot, JPA? ?????? ??? ??Scala, Spring-Boot, JPA? ?????? ??? ??
Scala, Spring-Boot, JPA? ?????? ??? ??
Javajigi Jaesung
?
Ndc2011 ?? ???_??_??????_????_??_?_??_???
Ndc2011 ?? ???_??_??????_????_??_?_??_???Ndc2011 ?? ???_??_??????_????_??_?_??_???
Ndc2011 ?? ???_??_??????_????_??_?_??_???
cranbe95
?
Jstl_GETCHA_HANJUNG
Jstl_GETCHA_HANJUNGJstl_GETCHA_HANJUNG
Jstl_GETCHA_HANJUNG
Jung Han
?
Spark performance tuning
Spark performance tuningSpark performance tuning
Spark performance tuning
haiteam
?
???????????(The way to setting the Spring framework for web.)
???????????(The way to setting the Spring framework for web.)???????????(The way to setting the Spring framework for web.)
???????????(The way to setting the Spring framework for web.)
EunChul Shin
?
MySQL_MariaDB-????-202201.pptx
MySQL_MariaDB-????-202201.pptxMySQL_MariaDB-????-202201.pptx
MySQL_MariaDB-????-202201.pptx
NeoClova
?
??? ??? ?? ?? ???? ???? ?? ???? CloudBread ????
??? ??? ?? ?? ???? ???? ?? ???? CloudBread ??????? ??? ?? ?? ???? ???? ?? ???? CloudBread ????
??? ??? ?? ?? ???? ???? ?? ???? CloudBread ????
Dae Kim
?
???? ???? Eclipse??
???? ???? Eclipse?????? ???? Eclipse??
???? ???? Eclipse??
cho hyun jong
?
Cloudera session seoul - Spark bootcamp
Cloudera session seoul - Spark bootcampCloudera session seoul - Spark bootcamp
Cloudera session seoul - Spark bootcamp
Sang-bae Lim
?
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache Tajo
Gruter
?
[2015 07-06-???] Oracle ?? ??? ? ?? ??? 4
[2015 07-06-???] Oracle ?? ??? ? ?? ??? 4[2015 07-06-???] Oracle ?? ??? ? ?? ??? 4
[2015 07-06-???] Oracle ?? ??? ? ?? ??? 4
Seok-joon Yun
?
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)
Gruter
?
?????? ????????? ???? ???? ????? ???????
?????? ????????? ???? ???? ????? ????????????? ????????? ???? ???? ????? ???????
?????? ????????? ???? ???? ????? ???????
Yeonhee Kim
?
Flamingo (FEA) Spark Designer
Flamingo (FEA) Spark DesignerFlamingo (FEA) Spark Designer
Flamingo (FEA) Spark Designer
BYOUNG GON KIM
?
Spark_Overview_qna
Spark_Overview_qnaSpark_Overview_qna
Spark_Overview_qna
?? ?
?
spark database Service
spark database Servicespark database Service
spark database Service
?? ?
?
Golang Project Guide from A to Z: From Feature Development to Enterprise Appl...
Golang Project Guide from A to Z: From Feature Development to Enterprise Appl...Golang Project Guide from A to Z: From Feature Development to Enterprise Appl...
Golang Project Guide from A to Z: From Feature Development to Enterprise Appl...
Kyuhyun Byun
?
Scala, Spring-Boot, JPA? ?????? ??? ??
Scala, Spring-Boot, JPA? ?????? ??? ??Scala, Spring-Boot, JPA? ?????? ??? ??
Scala, Spring-Boot, JPA? ?????? ??? ??
Javajigi Jaesung
?
Ndc2011 ?? ???_??_??????_????_??_?_??_???
Ndc2011 ?? ???_??_??????_????_??_?_??_???Ndc2011 ?? ???_??_??????_????_??_?_??_???
Ndc2011 ?? ???_??_??????_????_??_?_??_???
cranbe95
?
Jstl_GETCHA_HANJUNG
Jstl_GETCHA_HANJUNGJstl_GETCHA_HANJUNG
Jstl_GETCHA_HANJUNG
Jung Han
?
Spark performance tuning
Spark performance tuningSpark performance tuning
Spark performance tuning
haiteam
?
???????????(The way to setting the Spring framework for web.)
???????????(The way to setting the Spring framework for web.)???????????(The way to setting the Spring framework for web.)
???????????(The way to setting the Spring framework for web.)
EunChul Shin
?
MySQL_MariaDB-????-202201.pptx
MySQL_MariaDB-????-202201.pptxMySQL_MariaDB-????-202201.pptx
MySQL_MariaDB-????-202201.pptx
NeoClova
?
??? ??? ?? ?? ???? ???? ?? ???? CloudBread ????
??? ??? ?? ?? ???? ???? ?? ???? CloudBread ??????? ??? ?? ?? ???? ???? ?? ???? CloudBread ????
??? ??? ?? ?? ???? ???? ?? ???? CloudBread ????
Dae Kim
?

Spark sql

  • 2. 2 ?? ? Spark SQL ?? ? Tungsten execution engine ? Catalyst optimizer ? RDD, DataFrame ? Dataset ? Spark SQL application (in Java) ? Linking ? example ? ?? ??
  • 3. 3 Spark SQL ? ??/??? ???(structured/semi-structured Data) ??? ??? Spark Library ? Tungsten execution engine & Catalyst optimizer ?? ? ??? interface ?? ? SQL, HiveSQL queries ? Dataframe API ? Scala, Java, Python, and R?? ?? ?? ? Spark 1.3 ? Dataset API ? Scala, Java ?? ?? ?? ? Spark 1.6 ? ??? Input source? ?? ? RDD & ?? ??? ? JSON ??? ? ? Parquet file ? Hive Table ? ODBC/JCBC ???? ??
  • 4. 4 Tungsten execution engine ? ???? bottleneck?? ? I/O? network bandwidth? ?? ? High bandwidth SSD & striped HDD? ?? ? 10Gbps network? ?? ? CPU ? memory?? bottleneck ??? ?? ? ???? processing workload? ?? ? Disk I/O? ????? ?? input data pruning workload ? Shuffle? ?? serialization? hashing? ?? key bottleneck ? CPU? memory? ??? ??? ? ????? ??? ??? ??? ??? ? ?? System Engine? ??! ? Project Tungsten ? Spark 1.4?? DataFrame? ?? ? spark 1.6?? Dataset?? ?? ??: 1. Project Tungsten C databrick 2. https://issues.apache.org/jira/browse/SPARK-7075
  • 5. 5 Tungsten execution engine C three Goal ? Memory Management and Binary Processing ? JVM object ??? garbage collection? overhead? ? =>????? data? ??? ?? Java objects ??? binary format?? ?? ? ???? ?? ???? ??? ??? ?? => denser in-memory data format? ???? ??? ???? ??? ? ? ?? memory accounting (size of bytes) ?? ??(??? Heuristics ??) ? ?? ???? ??? domain semantics? ??? ??? data processing? ???? ? =>binary format? in memory data? ???? data type? ???? operator? ?? (serialization/deserialization ?? data processing) ? Cache-aware Computation ? sorting and hashing for aggregations, joins, and shuffle? ??? ??? ?? =>memory hierarchy? ???? algorithm? data struncture ? Code Generation ? expression evaluation, DataFrame/SQL operators, serializer? ?? ??? ?? =>modern compilers and CPUs? ??? ??? ??? ? ?? code generation
  • 6. 6 Catalyst optimizer ? ??? ??? ????? ?? ??? ?? ??? extensible optimizer ? extensible design ? ??? optimization techniques? feature?? ??? ?? ? ?? ???? optimizer? ???? ???? ?? ? Catalyst? ?? (In Spark SQL) ( ??? ??? paper ??) ? Tree ??? ???? optimization ? ?? 4??? ?? ?? ??: Catalyst Optimizer - databrick x+(1+2) ? tree ?? Catalyst? phase
  • 7. 7 RDD, DataFrame DataFrames / SQL Structured Binary Data (Tungsten) ? High level relational operation ?? ?? ? Catalyst optimization ?? ?? ? Lower memory pressure ? Memory accounting (avoid OOMs) ? Faster sorting / hashing / serialization RDDs Collections of Native JVM Objects ? ?? ?? ?? data type ??? ?? ? Compile-time type-safety ?? ? ??? ????? ?? ? ? ?? ?? ??? ?? ???.. ? ?? ?? cost ? ??? ?? boilerplate(?? ??, ??) ?? ?? ? ?? ??? ???? API? ??? ? ???? ? Catalyst optimizer & Tungsten execution engine? ??? ??? ? ??? ? ? Domain object? type? ??? ? ?? ?? ??? ? ??? ? ?? ??? & ???
  • 8. 8 Dataset ? RDD? DataFrame? ??? ?? ?? interface API ? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html ? ??? ? Fast ? Typesafe ? Support for a variety of object models ? Java Compatible ? Interoperates with DataFrames Dataset Structured Binary Data (Tungsten) ? High level relational operation ?? ?? ? Catalyst optimization ?? ?? ? Lower memory pressure ? Memory accounting (avoid OOMs) ? Faster sorting / hashing / serialization ? ?? ?? ?? data type ??? ?? ? Compile-time type-safety ?? ? ??? ????? ?? ?? ??: technicaltidbit.blogspot.kr/
  • 9. 9 Dataset ? Encoder ? Dataset? ??? ??? Structured Binary data? ?? ? JVM object? RDD? ??, DataFrame?? ??? ?? ? Processing ? ??? ???? serialization? ?? ? RDD/DataFrame type? data? Dataset?? ???? ???? Object? ??? ?? ??? Encoder? ??(?? ????? ???? ??) ? ?? ???? ???? java, kyro Serialization? ?? ?? ??? Data Serialization ?? ?? ?? ??: Introducing Spark Datasets- databrick
  • 10. 10 Dataset ? structured/semi-structured Data ?? => Dataset?? ?? RDD DataFrame Dataset ?? ??? ?? ?? ?? ??? ?? ?? ?? ??? ??? Type-safety ?? ???? ?? ?? ?? ??? ??? ??? ??? RDD-Dataset? WordCount ??? ?? ?? ?? RDD-Dataset? ?? ? memory ??? ?? ?? ??: Introducing Spark Datasets- databrick
  • 11. 11 Spark SQL application (in Java) ? Linking ? Pom.xml? ?? ?? ??
  • 12. 12 Spark SQL application (in Java) ? sample examples/src/main/resources/people.json examples/src/main/resources/people.txt
  • 13. 13 Spark SQL application (in Java) ? (DataFrame example 1) Jason File SparkConf sparkConf = new SparkConf().setAppName("dataFrame"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx); DataFrame df = sqlContext.read().json("examples/src/main/resources/people.json"); df.show();//1 df.printSchema();//2 df.select("name").show(); //3 df.select(df.col("name"), df.col("age").plus(1)).show(); //4 df.filter(df.col("age").gt(21)).show(); //5 df.groupBy("age").count().show();//6 df.registerTempTable("people"); DataFrame results = sqlContext.sql("SELECT name FROM people"); List<String> names = results.javaRDD().map(new Function<Row, String>() { public String call(Row row) {return "Name: " + row.getString(0); } }).collect(); for(String tuple : names){ //7 System.out.println(tuple); } ctx.stop(); 1. 2. 3. 4. 5. 6. 7. import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row;
  • 14. 14 Spark SQL application (in Java) ? (DataFrame example 2) Text File : Specifying the Schema SparkConf sparkConf = new SparkConf().setAppName("dataFrame"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx); JavaRDD<String> people = ctx.textFile("examples/src/main/resources/people.txt"); String schemaString = "name age"; List<StructField> fields = new ArrayList<StructField>(); for (String fieldName: schemaString.split(" ")) { fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true)); } StructType schema = DataTypes.createStructType(fields); JavaRDD<Row> rowRDD = people.map(new Function<String, Row>() { public Row call(String record) throws Exception { String[] fields = record.split(","); return RowFactory.create(fields[0], fields[1].trim()); } }); DataFrame peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema); peopleDataFrame.show();//1 peopleDataFrame.printSchema();//2 ctx.stop(); 1. 2. import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.DataFrame; import java.util.List;
  • 15. 15 Spark SQL application (in Java) ? (DataFrame example 3) Text File : Inferring the Schema(JavaBean) import java.io.Serializable; public class Person implements Serializable { private String name; private int age; public String getName() { return name; } public void setName(String name) { this.name = name; } public int getAge() { return age; } public void setAge(int age) { this.age = age; } }
  • 16. 16 Spark SQL application (in Java) ? (DataFrame example 3) Text File : Inferring the Schema(JavaBean) SparkConf sparkConf = new SparkConf().setAppName("dataFrame"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx); JavaRDD<Person> people = ctx.textFile("examples/src/main/resources/people.txt").map( new Function<String, Person>() { public Person call(String line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; }}); DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class); schemaPeople.registerTempTable("people"); DataFrame teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19"); teenagers.show();//1 teenagers.printSchema();//2 ctx.stop(); 1. 2. import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructType; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import java.util.ArrayList; import java.util.List;
  • 17. 17 Spark SQL application (in Java) ? (Dataset example) SparkConf sparkConf = new SparkConf().setAppName("dataset"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx); ... DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class); Dataset<Person> schools = schemaPeople.as(Encoders.bean(Person.class)); Dataset<String> strings = schools.map(new BuildString(), Encoders.STRING()); //Dataset<String> strings = schools.map(p-> p.getName()+" is "+ p.getAge()+" years old.", Encoders.STRING()); List<String> result = strings.collectAsList(); for(String tuple : result){//1 System.out.println(tuple); } ctx.stop(); class BuildString implements MapFunction<Person, String> { public String call(Person p) throws Exception { return p.getName() + " is " + p.getAge() + " years old."; } } 1. import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.MapFunction; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Encoders; import java.util.List;
  • 18. 18 Spark SQL application (in Java) ? (Dataset example) Encoder ? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Encoder.html ? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Encoders.html 1. Primitive type ??? List<String> data = Arrays.asList("abc", "abc", "xyz"); Dataset<String> ds = context.createDataset(data, Encoders.STRING()); 2. tuple type(K,V pair) ??? Encoder<Tuple2<Integer, String>> encoder2 = Encoders.tuple(Encoders.INT(), Encoders.STRING()); List<Tuple2<Integer, String>> data2 = Arrays.asList(new scala.Tuple2(1, "a"); Dataset<Tuple2<Integer, String>> ds2 = context.createDataset(data2, encoder2); 3. Java Beans? ??? reference type ??? Encoders.bean(MyClass.class);
  • 19. 19 ?? ?? ? Dataset ? https://issues.apache.org/jira/browse/SPARK-9999 ? http://spark.apache.org/docs/latest/sql-programming-guide.html ? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html ? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Encoder.html ? https://databricks.com/blog/2015/11/20/announcing-spark-1-6-preview-in-databricks.html ? https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html ? http://technicaltidbit.blogspot.kr/2015/10/spark-16-datasets-best-of-rdds-and.html ? http://www.slideshare.net/databricks/apache-spark-16-presented-by-databricks-cofounder-patrick-wendell ? Tungsten ? https://issues.apache.org/jira/browse/SPARK-7075 ? https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html ? catalyst ? https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html ? Michael Armbrust et al. Spark SQL: Relational Data Processing in Spark, In SIGMOD , 2015