数据, 术→技巧

使用SQL统计箱形图数据

钱魏Way · · 1,349 次浏览
!文章内容如有错误或排版问题,请提交反馈,非常感谢!

箱形图常用于数据的可视化,先前的文章中介绍过使用Python生成箱形图。箱形图中通常包含的数据有:

  • 最小值(不包含异常值)Lower Whisker = Q1 – 1.5(Q3 – Q1)
  • 最大值(不包含异常值)Upper Whisker = Q3 + 1.5(Q3 – Q1)
  • 平均值(可通过设置显示)
  • 中位数
  • 下四分位 Q1
  • 上四分位 Q3

除了可视化外,箱型图有时也会用于异常检测等场景。今天主要介绍的是如何通过SQL(HiveSQL)来获取箱形图的值。

方式一:使用分位函数获取近似值

hive中有两个函数percentile和percentile_approx,可以用来计算分位数。

  • percentile:percentile(col, p) col是要计算的列(值必须为int类型),p的取值为0-1,若为5即中位数。
  • percentile_approx:percentile_approx(col, p)。列为数值类型可以使int也可以使float。percentile_approx还有一种形式percentile_approx(col, p, B),参数B控制内存消耗的近似精度,B越大,结果的精度越高。默认值为10000。当col字段中的distinct值的个数小于B时,结果就为准确的百分位数。

代码实现:

SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower
, q3 + 1.5 * (q3 - q1) AS price_upper
FROM (
SELECT product, count(1) AS quantity, avg(price) AS price_mean
, max(price) AS price_max, min(price) AS price_min
, percentile_approx(price, 0.5) AS price_median
, percentile_approx(price, 0.25) AS q1
, percentile_approx(price, 0.75) AS q3
FROM testdb.product_price
GROUP BY product
) t

方式二:使用窗口函数获取精确值

WITH details AS (
SELECT product, price, ROW_NUMBER() OVER (PARTITION BY od ORDER BY price) AS rn
, SUM(1) OVER (PARTITION BY product) AS total
FROM testdb.product_price
),
quartiles AS (
SELECT product, price
, AVG(CASE
WHEN rn >= FLOOR(total / 2.0) / 2.0
AND rn<= FLOOR(total / 2.0) / 2.0 + 1
THEN price / 1.0
ELSE NULL
END) OVER (PARTITION BY product) AS q1
, AVG(CASE
WHEN rn >= total / 2.0
AND rn<= total / 2.0 + 1
THEN price / 1.0
ELSE NULL
END) OVER (PARTITION BY product) AS median
, AVG(CASE
WHEN rn >= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0
AND rn<= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0 + 1
THEN price / 1.0
ELSE NULL
END) OVER (PARTITION BY product) AS q3
FROM details
)
SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower
, q3 + 1.5 * (q3 - q1) AS price_upper
FROM (
SELECT product, count(1) AS order_count, avg(price) AS price_mean
, max(price) AS price_max, min(price) AS price_min
, AVG(median) AS price_median, AVG(q1) AS q1
, AVG(q3) AS q3
FROM quartiles
GROUP BY product
) t

参考链接:

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注