!文章内容如有错误或排版问题,请提交反馈,非常感谢!
箱形图常用于数据的可视化,先前的文章中介绍过使用Python生成箱形图。箱形图中通常包含的数据有:
- 最小值(不包含异常值)Lower Whisker = Q1 – 1.5(Q3 – Q1)
- 最大值(不包含异常值)Upper Whisker = Q3 + 1.5(Q3 – Q1)
- 平均值(可通过设置显示)
- 中位数
- 下四分位 Q1
- 上四分位 Q3
除了可视化外,箱型图有时也会用于异常检测等场景。今天主要介绍的是如何通过SQL(HiveSQL)来获取箱形图的值。
方式一:使用分位函数获取近似值
hive中有两个函数percentile和percentile_approx,可以用来计算分位数。
- percentile:percentile(col, p) col是要计算的列(值必须为int类型),p的取值为0-1,若为5即中位数。
- percentile_approx:percentile_approx(col, p)。列为数值类型可以使int也可以使float。percentile_approx还有一种形式percentile_approx(col, p, B),参数B控制内存消耗的近似精度,B越大,结果的精度越高。默认值为10000。当col字段中的distinct值的个数小于B时,结果就为准确的百分位数。
代码实现:
SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower , q3 + 1.5 * (q3 - q1) AS price_upper FROM ( SELECT product, count(1) AS quantity, avg(price) AS price_mean , max(price) AS price_max, min(price) AS price_min , percentile_approx(price, 0.5) AS price_median , percentile_approx(price, 0.25) AS q1 , percentile_approx(price, 0.75) AS q3 FROM testdb.product_price GROUP BY product ) t
方式二:使用窗口函数获取精确值
WITH details AS ( SELECT product, price, ROW_NUMBER() OVER (PARTITION BY od ORDER BY price) AS rn , SUM(1) OVER (PARTITION BY product) AS total FROM testdb.product_price ), quartiles AS ( SELECT product, price , AVG(CASE WHEN rn >= FLOOR(total / 2.0) / 2.0 AND rn<= FLOOR(total / 2.0) / 2.0 + 1 THEN price / 1.0 ELSE NULL END) OVER (PARTITION BY product) AS q1 , AVG(CASE WHEN rn >= total / 2.0 AND rn<= total / 2.0 + 1 THEN price / 1.0 ELSE NULL END) OVER (PARTITION BY product) AS median , AVG(CASE WHEN rn >= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0 AND rn<= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0 + 1 THEN price / 1.0 ELSE NULL END) OVER (PARTITION BY product) AS q3 FROM details ) SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower , q3 + 1.5 * (q3 - q1) AS price_upper FROM ( SELECT product, count(1) AS order_count, avg(price) AS price_mean , max(price) AS price_max, min(price) AS price_min , AVG(median) AS price_median, AVG(q1) AS q1 , AVG(q3) AS q3 FROM quartiles GROUP BY product ) t
参考链接: