Spark sql培训
- 5. 使?用?方法
? 1、Spark-shell提交
(1)提交命令:bin/spark-shell --master yarn --num-executors 10 --
driver-memory 2g --executor-memory 2g --executor-cores 1
(2)?入?口点是sqlContext或其?子类如HiveContext
scala> import org.apache.spark.sql.hive.HiveContext
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
(3)基于sqlContext查询
scala> sqlContext.sql("select * from **** where dt='20140909' limit 10").collect().foreach(println)
- 7. 使?用?方法
? 3、spark-sql命令
(1)提交命令:bin/spark-sql --master yarn --num-executors 200 --driver-memory
2g --executor-memory 2g --executor-cores 1
spark-sql> select * from where dt='20140909' limit 10;
(2)通过-f从?文件中执?行hql
spark-sql --master yarn ?--num-executors 10 --driver-memory 2g --executor-
memory 2g --executor-cores 1 -f hql.txt > result.data 2> hql.log
(3)通过-e执?行指定的hql
spark-sql --master yarn ?--num-executors 10 --driver-memory 2g --executor-
memory 2g --executor-cores 1 -e "select * from limit 10" > result.data 2>
hql.log
- 15. 使?用案例
? 使?用UDF与hive?一样,进?入spark-sql,然后通过add
jar和create temporary function就可以使?用了
spark-sql> add jar hive_udf.jar;
spark-sql> create temporary function url_to_mid as ‘’;
spark-sql> create temporary function mid as ‘**’;
spark-sql> select url_to_mid('z62QS3Ghr','1','0') from dual;
spark-sql> select?mid('3520617028999724')?from?dual;
- 21. 已知的问题
? GROUP BY的问题
GROUP BY操作会在map端聚合,采?用的是hash结
构,如果数据倾斜严重,单个key数据过多,会导致
OOM,hive通过hive.map.aggr设置为false取消,
spark sql没有参数可以控制