数据, 术→技巧

使用SQL统计箱形图数据

钱魏Way · · 75 次浏览

箱形图常用于数据的可视化,先前的文章中介绍过使用Python生成箱形图。箱形图中通常包含的数据有:

  • 最小值(不包含异常值)Lower Whisker = Q1-1.5(Q3-Q1)
  • 最大值(不包含异常值)Upper Whisker = Q3+1.5(Q3-Q1)
  • 平均值(可通过设置显示)
  • 中位数
  • 下四分位Q1
  • 上四分位Q3

除了可视化外,箱型图有时也会用于异常检测等场景。今天主要介绍的是如何通过SQL(Hive SQL)来获取箱形图的值。

方式一:使用分位函数获取近似值

hive中有两个函数percentile和percentile_approx,可以用来计算分位数。

  • percentile:percentile(col, p) col是要计算的列(值必须为int类型),p的取值为0-1,若为5即中位数。
  • percentile_approx:percentile_approx(col, p)。列为数值类型可以使int也可以使float。percentile_approx还有一种形式percentile_approx(col, p,B),参数B控制内存消耗的近似精度,B越大,结果的精度越高。默认值为10000。当col字段中的distinct值的个数小于B时,结果就为准确的百分位数。

代码实现:

SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower
    , q3 + 1.5 * (q3 - q1) AS price_upper
FROM (
    SELECT product, count(1) AS quantity, avg(price) AS price_mean
        , max(price) AS price_max, min(price) AS price_min
        , percentile_approx(price, 0.5) AS price_median
        , percentile_approx(price, 0.25) AS q1
        , percentile_approx(price, 0.75) AS q3
    FROM testdb.product_price
    GROUP BY product
) t

方式二:使用窗口函数获取精确值

WITH details AS (
        SELECT product, price, ROW_NUMBER() OVER (PARTITION BY od ORDER BY price) AS rn
            , SUM(1) OVER (PARTITION BY product ) AS total
        FROM testdb.product_price
    ), 
    quartiles AS (
        SELECT product, price
            , AVG(CASE 
                WHEN rn >= FLOOR(total / 2.0) / 2.0
                    AND rn <= FLOOR(total / 2.0) / 2.0 + 1
                THEN price / 1.0
                ELSE NULL
            END) OVER (PARTITION BY product ) AS q1
            , AVG(CASE 
                WHEN rn >= total / 2.0
                    AND rn <= total / 2.0 + 1
                THEN price / 1.0
                ELSE NULL
            END) OVER (PARTITION BY product ) AS median
            , AVG(CASE 
                WHEN rn >= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0
                    AND rn <= CEIL(total / 2.0) + FLOOR(total / 2.0) / 2.0 + 1
                THEN price / 1.0
                ELSE NULL
            END) OVER (PARTITION BY product ) AS q3
        FROM details
    )
SELECT *, q1 - 1.5 * (q3 - q1) AS price_lower
    , q3 + 1.5 * (q3 - q1) AS price_upper
FROM (
    SELECT product, count(1) AS order_count, avg(price) AS price_mean
        , max(price) AS price_max, min(price) AS price_min
        , AVG(median) AS price_median, AVG(q1) AS q1
        , AVG(q3) AS q3
    FROM quartiles
    GROUP BY product
) t

参考链接:

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注