!文章内容如有错误或排版问题,请提交反馈,非常感谢!
HiveSQL中的datediff函数返回的是2个日期的间隔天数。在使用过程中发现了一个比较有趣的坑:
SELECT customer_id, COUNT(DISTINCT date(createdate))-1 AS frequency , datediff(MAX(createdate), MIN(createdate)) AS recency , datediff(CURRENT_DATE, MIN(createdate)) AS T , CASE WHEN COUNT(DISTINCT createdate)-1=0 THEN 0 ELSE SUM(totaltakeoff)/COUNT(DISTINCT createdate) END AS monetary_value FROM orderdb.orderdetail
其中createdate为datetime类型,执行SQL发现存在recency>T的数据。执行如下代码,结果也存在问题:
SELECT customer_id, COUNT(DISTINCT date(createdate))-1 AS frequency , datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency , datediff(CURRENT_DATE, to_date(MIN(createdate))) AS T , CASE WHEN COUNT(DISTINCT createdate)-1=0 THEN 0 ELSE SUM(totaltakeoff)/COUNT(DISTINCT createdate) END AS monetary_value FROM orderdb.orderdetail
可能原因是CURRENT_DATE,其中CURRENT_DATE的值等于TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP()),使用TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())再做测试,结果又是正常的。
SELECT customer_id, COUNT(DISTINCT date(createdate))-1 AS frequency , datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency , datediff(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())), to_date(MIN(createdate))) AS T , CASE WHEN COUNT(DISTINCT createdate)-1=0 THEN 0 ELSE SUM(totaltakeoff)/COUNT(DISTINCT createdate) END AS monetary_value FROM orderdb.orderdetail
至此,大致可以将原因定位到CURRENT_DATE上,中间最主要的区别是:current_date返回的内容是date格式,而to_date返回的是字符串(2.1版本之前是string,2.1版本后面是返回date)。再做一次进一步测试如下:
SELECT datediff('2018-09-16 00:01:55','2018-09-16 00:00:52') AS t1 , datediff(current_date,'2018-09-16 00:00:52') AS t2 , datediff(current_date,'2018-09-17 00:00:52') AS t3 , datediff(current_date,'2018-09-16') AS t4 , datediff(current_date,'2018-09-17') AS t5 , datediff(TO_DATE(current_date),'2018-09-16 00:00:52') AS t6 , datediff(TO_DATE(current_date),'2018-09-17 00:00:52') AS t7 , datediff(TO_DATE(current_date),'2018-09-16') AS t8 , datediff(TO_DATE(current_date),'2018-09-17') AS t9
执行后的结果为:0 0 0 1 0 1 0 1,进一步寻找,发现已经人提交了bug:
[https://issues.apache.org/jira/browse/HIVE-18304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297824#comment-16297824 ] Hengyu Dai commented on HIVE-18304: ----------------------------------- SimpleDateFormat.parse(String source) method will convert String type (UTC) to java.util.Date type (use current JVM timezone), this may lead deviations in time when JVM timezone is not UTC, my environment is GMT+8, 8 hours is added comparing to the UTC time. while for a date type argument, the default JVM timezone is used. The patch uploaded treats String type and Date type the same way to remove the deviations. > datediff() UDF returns a wrong result when dealing with a (date, string) input >------------------------------------------------------------------------------ > >Key: HIVE-18304 >URL: https://issues.apache.org/jira/browse/HIVE-18304 >Project: Hive >Issue Type: Bug >Components: UDF >Reporter: Hengyu Dai >Assignee: Hengyu Dai >Priority: Minor >Attachments: 0001.patch > > >for date type argument, datediff() use DateConverter to convert input to a java Date object, >for example, a '2017-12-18' will get 2017-12-18T00:00:00.000+0800 >for string type argument, datediff() use TextConverter to convert a string to date, >for '2012-01-01' we will get 2012-01-01T08:00:00.000+0800 >now, datediff() will return a number less than the real date diff >we should use TextConverter to deal with date input too. >reproduce: >{code:java} >select datediff(cast('2017-12-18' as date), '2012-01-01'); --2177 >select datediff('2017-12-18', '2012-01-01'); --2178 >{code}