Hive SQL中的datediff函数返回的是2个日期的天数。在使用过程中发现了一个比较有趣的坑:
1 2 3 4 5 6 7 8 |
SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency , datediff(MAX(createdate), MIN(createdate)) AS recency , datediff(CURRENT_DATE, MIN(createdate)) AS T , CASE WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0 ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate) END AS monetary_value FROM orderdb.orderdetail |
其中createdate为datetime类型,执行SQL发现存在recency>T的数据。出现上述问题,初步判断是由于createtime导致的,查询文档发现并不是这么一回事:
1 2 3 4 5 6 |
hive> desc function extended datediff; OK datediff(date1, date2) - Returns the number of days between date1 and date2 date1 and date2 are strings in the format 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time parts are ignored.If date1 is earlier than date2, the result is negative. Example: > SELECT datediff('2018-07-30', '2018-07-31') FROM src LIMIT 1; |
执行如下代码,结果也存在问题:
1 2 3 4 5 6 7 8 |
SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency , datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency , datediff(CURRENT_DATE, to_date(MIN(createdate))) AS T , CASE WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0 ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate) END AS monetary_value FROM orderdb.orderdetail |
那么最有可能出现问题的是CURRENT_DATE,其中CURRENT_DATE的值等于TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP()),使用TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())再做测试,结果又是正常的。
1 2 3 4 5 6 7 8 |
SELECT customer_id, COUNT(DISTINCT date(createdate)) - 1 AS frequency , datediff(to_date(MAX(createdate)), to_date(MIN(createdate))) AS recency , datediff(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())), to_date(MIN(createdate))) AS T , CASE WHEN COUNT(DISTINCT createdate) - 1 = 0 THEN 0 ELSE SUM(totaltakeoff) / COUNT(DISTINCT createdate) END AS monetary_value FROM orderdb.orderdetail |
至此,大致可以将原因定位到CURRENT_DATE上,中间最主要的区别是:current_date返回的内容是date格式,而to_date返回的是字符串(2.1版本之前是string,2.1版本后面是返回date)。再做一次进一步测试如下:
1 2 3 4 5 6 7 8 9 |
SELECT datediff('2018-09-16 00:01:55', '2018-09-16 00:00:52') AS t1 , datediff(current_date, '2018-09-16 00:00:52') AS t2 , datediff(current_date, '2018-09-17 00:00:52') AS t3 , datediff(current_date, '2018-09-16') AS t4 , datediff(current_date, '2018-09-17') AS t5 , datediff(TO_DATE(current_date), '2018-09-16 00:00:52') AS t6 , datediff(TO_DATE(current_date), '2018-09-17 00:00:52') AS t7 , datediff(TO_DATE(current_date), '2018-09-16') AS t8 , datediff(TO_DATE(current_date), '2018-09-17') AS t9 |
执行后的结果为:0 0 0 0 1 0 1 0,进一步寻找,可获得如下信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
[ https://issues.apache.org/jira/browse/HIVE-18304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297824#comment-16297824 ] Hengyu Dai commented on HIVE-18304: ----------------------------------- SimpleDateFormat.parse(String source) method will convert String type(UTC) to java.util.Date type(use current JVM timezone), this may lead deviations in time when JVM timezone is not UTC, my environment is GMT+8, 8 hours is added comparing to the UTC time. while for a date type argument, the default JVM timezone is used. The patch uploaded treats String type and Date type at the same way to remove the deviations. > datediff() UDF returns a wrong result when dealing with a (date, string) input > ------------------------------------------------------------------------------ > > Key: HIVE-18304 > URL: https://issues.apache.org/jira/browse/HIVE-18304 > Project: Hive > Issue Type: Bug > Components: UDF > Reporter: Hengyu Dai > Assignee: Hengyu Dai > Priority: Minor > Attachments: 0001.patch > > > for date type argument, datediff() use DateConverter to convert input to a java Date object, > for example, a '2017-12-18' will get 2017-12-18T00:00:00.000+0800 > for string type argument, datediff() use TextConverter to convert a string to date, > for '2012-01-01' we will get 2012-01-01T08:00:00.000+0800 > now, datediff() will return a number less than the real date diff > we should use TextConverter to deal with date input too. > reproduce: > {code:java} > select datediff(cast('2017-12-18' as date), '2012-01-01'); --2177 > select datediff('2017-12-18', '2012-01-01'); --2178 > {code} |


我也遇到了同样的问题
感谢作者,不知道现在有没有反馈了
hive> select datediff(to_date(current_date()), to_date(‘2019-02-14’));
OK
35
hive> select datediff(current_date(), to_date(‘2019-02-14’));
OK
34
我也遇到了同样的问题
感谢作者,不知道现在有没有反馈了
hive> select datediff(to_date(current_date()), to_date(‘2019-02-14’));
OK
35
hive> select datediff(current_date(), to_date(‘2019-02-14’));
OK
34
已经看到hive的bug 解决了,感谢作者