Impala

https://impala.apache.org/docs/build/

https://impala.apache.org/docs/build/topics/

https://impala.apache.org/docs/build/html/topics/

Impala 内置函数: https://impala.apache.org/docs/build/html/topics/impala_functions.html

数学函数

https://impala.apache.org/docs/build/html/topics/impala_math_functions.html#math_functions

类型转换函数

https://impala.apache.org/docs/build/html/topics/impala_conversion_functions.html#conversion_functions

cast(year(“2017-06-15”) as decimal(8,0)) as date_year,

cast(‘100’ as int), – 100

typeof(‘100’), – STRING 查看字段类型

数据类型

https://impala.apache.org/docs/build/html/topics/impala_datatypes.html

日期和时间函数

https://impala.apache.org/docs/build/html/topics/impala_datetime_functions.html#datetime_functions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
select trunc('2022-04-09', 'y'); -- 当年第一天
select trunc('2022-04-09', 'yy'); -- 当年第一天
select trunc('2022-04-09', 'yyyy'); -- 当年第一天
select trunc('2022-04-09', 'mm'); -- 当月第一天
select trunc('2022-04-09', 'dd'); -- 当天第一天,即当天
select trunc('2022-04-09', 'd'); -- 当前星期第一天(具体是周一还是周日要看数据库)
select trunc('2022-04-09', 'hh'); -- 当年第一天
select trunc('2022-04-09', ' mi'); -- 当年第一天
select dayofweek( CAST('2022-04-09' AS TIMESTAMP )); -- 返回日期论证的日期字段,对应于一周中的一天。返回值的范围为1(周日)到7(周六)。

SELECT
trunc ( '2022-04-09', 'd' ) AS week_first_day, -- 所在周第一天
trunc ( '2022-04-09', 'd' ) + INTERVAL 6 DAY AS week_last_day; -- 所在周最后一天

SELECT
date_sub(
cast( '2022-04-09' AS TIMESTAMP ),
INTERVAL
IF
( dayofweek( CAST( '2022-04-09' AS TIMESTAMP ))- 1 = 0, 6, dayofweek( CAST( '2022-04-09' AS TIMESTAMP ))- 2 ) DAY
) AS week_first_day, -- 所在周第一天
date_sub(
cast( '2022-04-09' AS TIMESTAMP ),
INTERVAL
IF
( dayofweek( CAST( '2022-04-09' AS TIMESTAMP ))- 1 = 0, 0, dayofweek( CAST( '2022-04-09' AS TIMESTAMP ))- 8 ) DAY
) AS week_last_day; -- 所在周最后一天

查询某一天是当年的第几周,年第一天非周一的日期归属到去年最后一周

输入 data_dt 输出 dt_week
20210103 2020W53
20201231 2020W53
20190104 2019W1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
CASE WHEN
substr( data_dt, 5, 2 ) = '01' AND cast( weekofyear( cast( concat( substr( data_dt, 1, 4 ), '-', substr( data_dt, 5, 2 ), '-', substr( data_dt, 7, 2 )) AS TIMESTAMP )) AS string ) = '53'
THEN
concat( substr( cast( add_months ( cast( concat( substr( data_dt, 1, 4 ), '-', substr( data_dt, 5, 2 ), '-', substr( data_dt, 7, 2 )) AS TIMESTAMP ), -1 ) AS string ), 1, 4 ),
'W',
cast( weekofyear( cast( concat( substr( data_dt, 1, 4 ), '-', substr( data_dt, 5, 2 ), '-', substr( data_dt, 7, 2 )) AS TIMESTAMP )) AS string )
)
ELSE
concat( substr( data_dt, 1, 4 ),
'W',
cast( weekofyear( cast( concat ( substr( data_dt, 1, 4 ), '-', substr( data_dt, 5, 2 ), '-', substr( data_dt, 7, 2 )) AS TIMESTAMP )) AS string )
)
END AS dt_week
FROM
( SELECT '20210103' AS data_dt UNION ALL SELECT '20201231' UNION ALL SELECT '20190104' ) tt

条件函数

https://impala.apache.org/docs/build/html/topics/impala_conditional_functions.html#conditional_functions

字符串函数

https://impala.apache.org/docs/build/html/topics/impala_string_functions.html#string_functions

聚合函数

https://impala.apache.org/docs/build/html/topics/impala_aggregate_functions.html#aggregate_functions

分析函数

https://impala.apache.org/docs/build/html/topics/impala_analytic_functions.html#analytic_functions

AVG

COUNT

CUME_DIST:返回值的累积分布。结果集中每一行的值都大于 0 且小于或等于 1。

DENSE_RANK:递增不跳跃并列的序列号1,1,2

FIRST_VALUE:分组的第一个数据

LAG:分组的后几个数据。Lagging 落后于

LAST_VALUE:分组的最后一个数据

LEAD:分组的前几个数据。Leading 领先于

MAX

MIN

NTH_VALUE:返回相对于窗口的第一行的窗口框架的指定行的表达式值。

NTILE:将分区中已排序的行划分为大小尽可能相等的指定数量的已排名组,并返回给定行所在的组。

PERCENT_RANK:百分比排名函数。计算公式:(rank - 1) / (rows - 1)

RANK:递增跳跃的序列号1,1,3

ROW_NUMBER:依次递增的序列号1,2,3

SUM

简单分类

序号函数:row_number() / rank() / dense_rank()

分布函数:percent_rank() / cume_dist()

前后函数:lag() / lead()

头尾函数:first_val() / last_val()

其他函数:nth_value() / ntile()

OVER 子句

analytic functions:分析函数

aggregate functions:聚合函数

调用纯分析函数(例如:LEAD()RANK()FIRST_VALUE())必须使用 OVER 子句。

当调用的聚合函数包含 OVER 子句时(例如:MAX()COUNT()SUM()),它们将作为分析函数运行。

语法:

1
2
3
4
function(args) OVER([partition_by_clause] [order_by_clause [window_clause]])

partition_by_clause ::= PARTITION BY expr [, expr ...]
order_by_clause ::= ORDER BY expr [ASC | DESC] [NULLS FIRST | NULLS LAST] [, expr [ASC | DESC] [NULLS FIRST | NULLS LAST] ...]

限制

不能直接将 DISTINCT 运算符与分析函数调用结合起来。可以将分析函数调用放在 WITH 子句或内联视图中,并将 DISTINCT运算符应用于其结果集。

1
2
3
4
5
6
7
WITH tt (
SELECT
x,
sum(x) OVER (PARTITION BY x) AS total
FROM t1
)
SELECT DISTINCT x, total FROM tt;

window 子句

窗口子句仅允许与 ORDER BY 子句组合使用。

如果指定了ORDER BY子句,但没有指定 window 子句,默认窗口是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

某些分析函数接受可选的 window 子句,该子句使函数仅分析当前行“周围”的某些行,而不是分区中的所有行。例如,您可以通过指定一定数量的前行和后行来获得移动平均数,或者通过指定截至当前位置的所有行来获得运行计数或运行总计。该子句可能导致同一分区内的行的分析结果不同。

支持窗口子句的函数:AVG()COUNT()FIRST_VALUE()LAST_VALUE()SUM()

对于 MAX()MIN(),仅当开始边界为 UNBOUNDED PRECEDING 时才允许使用 window 子句。

语法:

1
2
ROWS BETWEEN [ { m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ]
RANGE BETWEEN [ {m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ]

ROWS BETWEEN 根据结果集中行的索引定义窗口的大小。窗口的大小是根据结果集中的位置子句预测的。

RANGE BETWEEN 当前不支持使用数字参数来定义可变大小的滑动窗口。

目前,Impala 仅支持 RANGE 子句的一些参数组合:

  • RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(当指定了 ORDER BY 并且省略了 window子句 时的默认值)
  • RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
  • RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

使用时 RANGECURRENT ROW不仅包括当前行,还包括基于 ORDER BY 表达式与当前行关联的所有行。

窗口函数参考

https://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_SQL_functions.html

位函数

https://impala.apache.org/docs/build/html/topics/impala_bit_functions.html#bit_functions

杂项功能

https://impala.apache.org/docs/build/html/topics/impala_misc_functions.html#misc_functions

Impala 支持以下不对特定列或数据类型进行操作的实用程序函数:

1
2
3
4
5
6
7
8
9
10
CURRENT_DATABASE
EFFECTIVE_USER
GET_JSON_OBJECT
LOGGED_IN_USER
PID
SLEEP
USER
UUID
VERSION
COORDINATOR

关联查询语句

https://impala.apache.org/docs/build/html/topics/impala_joins.html

full outer join 不超过万速度还可以

cross join 不超过万速度还可以,又称为笛卡尔乘积,实际上是把两个表乘起来,理论上不应该支持on语法

natural join 自然连接不支持

连接查询的性能注意事项

https://impala.apache.org/docs/build/html/topics/impala_perf_joins.html

EXPLAIN语句

https://impala.apache.org/docs/build/html/topics/impala_explain.html

ApacheImpala优化建议子集

1、尽量使用数字类型

2、选择字段的字段只选择需要的字段

3、left join时左边放大表表右边放小表,right join时右大左小

4、Join 字段最好使用整数类型。

行转列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
SELECT
operator,
year_1,
year_2,
year_3
FROM
(
SELECT
operator AS operator,
YEAR "year_1",
lead( YEAR, 1 ) over ( PARTITION BY operator ORDER BY YEAR ) "year_2",
lead( YEAR, 2 ) over ( PARTITION BY operator ORDER BY YEAR ) "year_3",
row_number() over ( PARTITION BY operator ORDER BY YEAR ) rn
FROM
( SELECT '65222' AS operator, '1' AS YEAR UNION ALL SELECT '65222', '2' UNION ALL SELECT '65222', '3' ) tt
) a
WHERE
rn = 1

多行聚合一行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SELECT
tt.site,
group_concat( band SEPARATOR '+' ) AS band_plus
FROM
(
SELECT
'700M' AS band,
'b' AS site UNION ALL
SELECT
'700M',
'a' UNION ALL
SELECT
'2100M',
'a' UNION ALL
SELECT
'2600M',
'a' UNION ALL
SELECT
'850M',
'b'
) tt
GROUP BY
tt.site

一行切割为多行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- 数据表条数必须大于等于字段分割后的最大项数
SELECT
r1.site,
r1.union_frequencty,
split_part ( r1.union_frequencty, '+', item_num ) union_frequencty_split
FROM
( SELECT 'a' AS site, '700M+2100M+2600M' AS union_frequencty UNION ALL SELECT 'b', '700M+850M' UNION ALL SELECT 'c', '700M+900M' UNION ALL SELECT 'd', '700M+1800M' ) r1
CROSS JOIN (
SELECT
item_num
FROM
(
SELECT
row_number() over ( ORDER BY union_frequencty ) item_num
FROM
( SELECT 'a' AS site, '700M+2100M+2600M' AS union_frequencty UNION ALL SELECT 'b', '700M+850M' UNION ALL SELECT 'c', '700M+900M' UNION ALL SELECT 'd', '700M+1800M' ) tt
) aa
WHERE
item_num <= 10 -- 分割后的最大项数限制为10,这里不一定要这么些,只要查出1-10的序列即可
) r2
WHERE
-- CROSS JOIN条件,保证笛卡尔积后的条数等于该条数据的项数
item_num <= (length( r1.union_frequencty ) - length(REPLACE ( r1.union_frequencty, '+', '' )))+ 1;

不同排序维度数据提取

1
2
3
4
5
6
7
SELECT
tt.*,
rank() over ( PARTITION BY part ORDER BY download ) rank_num,-- 递增跳跃的序列号 1,2,2,4
dense_rank() over ( PARTITION BY part ORDER BY download ) dense_rank_num,-- 递增不跳跃并列的序列号 1,2,2,3
row_number() over ( PARTITION BY part ORDER BY download ) row_num -- 依次递增的序列号 1,2,3,4
FROM
( SELECT 'A' AS part, '3' AS download UNION ALL SELECT 'A', '5' UNION ALL SELECT 'A', '2' UNION ALL SELECT 'A', '3' ) tt

经纬度计算距离

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT
round(
6371004 *
2 *
asin(
sqrt(
power( sin( (radians(g3latitude)-radians(g4latitude)) / 2 ), 2 ) +
cos( radians(g3latitude)) * cos(radians(g4latitude)) * power( sin( (radians(g3longitude)-radians(g4longitude)) / 2 ), 2 )
)
)
) as distances
FROM
( SELECT '-31.390778' AS g3latitude, '-64.253167' AS g3longitude, '-31.164440' AS g4latitude, '-64.315277' AS g4longitude ) tt

线性回归/线性拟合

处理前数据 month xx yy 预期输出 month xx yy
2019-09 2723822.6 1.44 2019-09 2723822.6 1.44
2019-10 2740666.4 1.47 2019-10 2740666.4 1.47
2019-11 2847629.4 1.49 2019-11 2847629.4 1.49
2020-01 2830692.2 1.51 2020-01 2830692.2 1.51
2020-02 2751622.5 1.53 2020-02 2751622.5 1.53
2020-03 2828788.1 1.54 2020-03 2828788.1 1.54
2020-04 3009157.4 1.58 2020-04 3009157.4 1.58
2020-05 2870360.3 1.62 2020-05 2870360.3 1.62
2020-06 2958902.3 1.66 2020-06 2958902.3 1.66
2020-07 3023999.5 2020-07 3023999.5 1.64
2020-08 3136022.3 2020-08 3136022.3 1.70
2020-09 3265788.6 2020-09 3265788.6 1.78
2020-10 3348522.8 2020-10 3348522.8 1.82
2020-11 3439811.7 2020-11 3439811.7 1.87
2020-12 3532455.9 2020-12 3532455.9 1.93
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
DROP TABLE IF EXISTS base_table;
CREATE TABLE base_table as
select '2019-09' as month, 2723822.6 as xx, 1.44 as yy from dual union all
select '2019-10', 2740666.4, 1.47 from dual union all
select '2019-11', 2847629.4, 1.49 from dual union all
select '2020-01', 2830692.2, 1.51 from dual union all
select '2020-02', 2751622.5, 1.53 from dual union all
select '2020-03', 2828788.1, 1.54 from dual union all
select '2020-04', 3009157.4, 1.58 from dual union all
select '2020-05', 2870360.3, 1.62 from dual union all
select '2020-06', 2958902.3, 1.66 from dual union all
select '2020-07', 3023999.5, null from dual union all
select '2020-08', 3136022.3, null from dual union all
select '2020-09', 3265788.6, null from dual union all
select '2020-10', 3348522.8, null from dual union all
select '2020-11', 3439811.7, null from dual union all
select '2020-12', 3532455.9, null from dual;

DROP TABLE IF EXISTS middle_table_1;
CREATE TABLE middle_table_1 AS
SELECT
tt.*,
CASE
WHEN yy IS NOT NULL THEN 1
ELSE 6
END AS mark,
row_number() over ( ORDER BY MONTH ) AS row_num
FROM
base_table tt;

DROP TABLE IF EXISTS middle_table_2;
CREATE TABLE middle_table_2 AS
SELECT
avg( xx ) AS x_avg,
avg( yy ) AS y_avg
FROM
middle_table_1 tt
WHERE
mark = 1;

DROP TABLE IF EXISTS middle_table_3;
CREATE TABLE middle_table_3 AS
SELECT
aa.*,
bb.*,
( xx - x_avg ) * ( yy - y_avg ) AS b_numerator,
power( ( xx - x_avg ), 2 ) AS b_denominator
FROM
middle_table_1 aa
CROSS JOIN middle_table_2 bb;

DROP TABLE IF EXISTS middle_table_4;
CREATE TABLE middle_table_4 AS
-- (2) 、参数表1关联系数b,算出系数a
SELECT
aa.*,
bb.b,
aa.y_avg - bb.b * aa.x_avg AS a
FROM
middle_table_2 aa
CROSS JOIN
-- (1) 、通过标记行算出b_numerator和b_denominator各自的和,相除之后就是
( SELECT sum( b_numerator )/ sum( b_denominator ) AS b FROM middle_table_3 WHERE mark = 1 ) bb;

-- 最终
SELECT
month,
xx,
CASE
WHEN mark = 1 THEN yy
ELSE b * xx + a
END AS y_new -- 算出来的值
FROM
-- 将主表2和参数表2进行全关联 ,获取a,b
( SELECT aa.*, a, b FROM middle_table_3 aa CROSS JOIN middle_table_4 bb ) cc;

待熟练函数

	获取数值的对应的百分位数值percentile