Search code examples
hivehqlunix-timestamp

Hive Date String Validation


I am trying to check if a string is in valid date format of ‘YYYYMMDD’ .

I am using the below technique. But for an invalid date string I am getting a valid date result.

What am I doing wrong?

SELECT'20019999',CASE WHEN unix_timestamp('20019999','YYYYMMDD')  > 0 THEN  'Good'ELSE 'Bad'END;

Solution

  • First of all, you are using the wrong format

    select  from_unixtime(unix_timestamp())                 as default_format
           ,from_unixtime(unix_timestamp(),'YYYY-MM-DD')    as wrong_format
           ,from_unixtime(unix_timestamp(),'yyyy-MM-dd')    as right_format
    ;
    

    +----------------------+---------------+---------------+
    |    default_format    | wrong_format  | right_format  |
    +----------------------+---------------+---------------+
    | 2017-10-07 04:13:26  | 2017-10-280   | 2017-10-07    |
    +----------------------+---------------+---------------+
    

    Second, there is no validation on date parts range.
    If you increase the day part by 1 it forwards you to the following day.

    with t as (select stack(7,'27','28','29','30','31','32','33') as dy)
    select  t.dy
           ,from_unixtime(unix_timestamp(concat('2017-02-',t.dy),'yyyy-MM-dd'),'yyyy-MM-dd') as dt
    
    from    t
    ;
    

    +-----+-------------+
    | dy  |     dt      |
    +-----+-------------+
    | 27  | 2017-02-27  |
    | 28  | 2017-02-28  |
    | 29  | 2017-03-01  |
    | 30  | 2017-03-02  |
    | 31  | 2017-03-03  |
    | 32  | 2017-03-04  |
    | 33  | 2017-03-05  |
    +-----+-------------+
    

    If you increase the month part by 1 it forwards you to the following month.

    with t as (select stack(5,'10','11','12','13','14') as mn)
    select  t.mn
           ,from_unixtime(unix_timestamp(concat('2017-',t.mn,'-01'),'yyyy-MM-dd'),'yyyy-MM-dd') as dt
    
    from    t
    ;
    

    +-----+-------------+
    | mn  |     dt      |
    +-----+-------------+
    | 10  | 2017-10-01  |
    | 11  | 2017-11-01  |
    | 12  | 2017-12-01  |
    | 13  | 2018-01-01  |
    | 14  | 2018-02-01  |
    +-----+-------------+
    

    Even with CAST the validation is done only on the parts ranges and not on the date itself.

    select cast('2010-02-32' as date);
    

    +-------+
    |  _c0  |
    +-------+
    | NULL  |
    +-------+
    

    select cast('2010-02-29' as date);
    

    +-------------+
    |     _c0     |
    +-------------+
    | 2010-03-01  |
    +-------------+
    

    Here is a way to achive you goal:

    with t as (select '20019999' as dt)
    select  dt  
           ,from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyyMMdd') as double_converted_dt    
    
           ,case 
                when from_unixtime(unix_timestamp(dt,'yyyyMMdd'),'yyyyMMdd')  = dt 
                then 'Good' 
                else 'Bad' 
            end             as dt_status
    
    from    t
    ;
    

    +-----------+----------------------+------------+
    |    dt     | double_converted_dt  | dt_status  |
    +-----------+----------------------+------------+
    | 20019999  | 20090607             | Bad        |
    +-----------+----------------------+------------+