8. Left outer join 产生表A的完全集,而B表中匹配的则有值,没
有匹配的则以NULL值填充。
sql基础_表连接
left outer join
执行sql:
select
A.user_id,
A.user_name,
B.user_id,
B.user_name
from elective_english_students A
left outer join
elective_math_students B
on A.user_id=B.user_id;
right outer join 产生表B的完全集,而A表中匹配的则有
值,没有匹配的则以NULL值填充。
9. Full outer join 产生A和B的并集。但是需要注意的是,对于没
有匹配的记录,则会以NULL值填充。
sql基础_表连接
full outer join
执行sql:
select
A.user_id,
A.user_name,
B.user_id,
B.user_name
from elective_english_students A
full outer join
elective_math_students B
on A.user_id=B.user_id;
mysql不支持full outer join
12. ...
10007
10008
10009
1
40
42
45
...
sql基础_排序
ORDER BY和SORT BY
ORDER BY全局排序,就是对指定的所有排序键进
行全局排序,使用ORDER BY的查询语句,最后会用
一个Reduce Task来完成全局排序。
SORT BY
SORT BY用于分区内排序,即每个 Reduce任务内排
序
set mapred.reduce.tasks=2;
select id from dim.city sort by id
[order | sort] by column [desc|asc]
desc 为降序,asc 为升序
sort by
18. 窗口分析函数
OVER(PARTITION BY COLUMN ROWS BETWEEN ..)
ROWS ((CURRENT ROW) | (UNBOUNDED | [num]) PRECEDING) AND
(UNBOUNDED | [num]) FOLLOWING
1. PRECEDING:往前
2. FOLLOWING:往后
3. CURRENT ROW:当前行
4. UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED
FOLLOWING:表示到后面的终点
如果不指定ROWS BETWEEN,默认为从起点到当前值
SUM(pay_amount) OVER (PARTITION BY userid ORDER BY pay_datekey ROWS
BETWEEN 2 PRECEDING AND 0 FOLLOWING)
分析函数
22. SELECT
user_id,
pay_datekey,
pay_amount,
ROW_NUMBER() OVER (PARTITION BY userid ORDER BY pay_datekey ASC)
row_number,
RANK() OVER (PARTITION BY userid ORDER BY pay_datekey ASC) rank,
DENSE_RANK() OVER (PARTITION BY userid ORDER BY pay_datekey ASC) dense_rank,
PERCENT_RANK() OVER (PARTITION BY userid ORDER BY pay_datekey ASC)
percent_rank,
NTILE(2) OVER (PARTITION BY userid ORDER BY pay_datekey ASC) ntile
FROM order_detail
分析函数
29. 条件函数
if语句
用法: if(conditional expr, true value, false value)
当表达式expr成立时,取true value,当表达式expr不成立时,取false value
case when 语句
case
when rank_biz_sequence>1 then '老客'
when rank_biz_sequence=1 and rank_all_biz_sequence>1 then '转新'
when rank_biz_sequence=1 and rank_all_biz_sequence=1 then '纯新'
else ‘未知’
end
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
使用说明: When a = b, returns c; when a = d, returns e; else returns f.
常用hive udf
30. 条件函数
nvl(T value, T default_value)
如果value为null,则返回默认值
COALESCE(T v1, T v2, ...)
返回第一个不为null的值,如果都为null 则返回null
boolean isnull( a )
使用说明:Returns true if a is NULL and false otherwise.
boolean isnotnull ( a )
使用说明:Returns true if a is not NULL and false otherwise.
greatest(T v1, T v2, ...)
使用说明:Returns the greatest value of the list of values
least(T v1, T v2, ...)
使用说明:Returns the least value of the list of values
常用hive udf
31. 日期函数
1、int datediff(string enddate, string startdate)
使用说明: Returns the number of days from startdate to enddate
使用示例:
datediff('2009-03-01', '2009-02-27') = 2
datediff('2009-02-27', '2009-03-01') = -2
2、string date_sub(string startdate, int days)
返回从startdate减去days天的日期: date_sub('2008-12-31', 1) = '2008-12-30'
3、string date_add(string startdate, int days)
返回从startdate减去days天的日期: date_sub('2008-12-31', 1) = '2008-12-30'
4、bigint unix_timestamp()
返回当前时间的时间戳,精度为秒
5、string from_unixtime(bigint unixtime, string format)
将时间戳(seconds)转化为格式化的日期;
from_unixtime('1237573801','yyyy-MM-dd HH:mm:ss') = ‘2009-03-20 11:30:01’
from_unixtime('1237573801','yyyy-MM-dd') = ’2009-03-20‘
SELECT from_unixtime(unix_timestamp(),'yyyy-MM-dd HH:mm:ss') from dim.city limit 1
输出:2015-09-17 11:02:24
常用hive udf
34. 1、少用COUNT DISTINCT,数据量小的时候无所谓,数据量大的情况下,由于 COUNT
DISTINCT操作需要用一个Reduce Task来完成,这一个Reduce需要处理的数据量太大,就会
导致整个Job很难完成,一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换。
select count(distinct uuid) from log.mobilelog where dt=’20150501’
====>
select sum(1) from (select uuid from log.mobilelog where dt=’20150501’ group by uuid)t
SQL优化