Search code examples
csvhiveopencsvamazon-athenapresto

Athena unable to parse date using OpenCSVSerde


I have a very simple csv file on S3

"i","d","f","s"
"1","2018-01-01","1.001","something great!"
"2","2018-01-02","2.002","something terrible!"
"3","2018-01-03","3.003","I'm an oil man"

I'm trying to create a table across this using the following command

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

When I query the table (select * from test) I'm getting an error like this:

HIVE_BAD_DATA:
Error parsing field value '2018-01-01' for field 1: For input string: "2018-01-01"

Some more info:

  • If I change the d column to a string the query will succeed
  • I've previously parsed dates in text files using Athena; I believe using LazySimpleSerDe
  • Definitely seems like a problem with the OpenCSVSerde

The documentation definitely implies that this is supported. Looking for anyone who has encountered this, or any suggestions.


Solution

  • In fact, it is a problem with the documentation that you mentioned. You were probably referring to this excerpt:

    [OpenCSVSerDe] recognizes the DATE type if it is specified in the UNIX format, such as YYYY-MM-DD, as the type LONG.

    Understandably, you were formatting your date as YYYY-MM-DD. However, the documentation is deeply misleading in that sentence. When it refers to UNIX format, it actually has UNIX Epoch Time in mind.

    Based on the definition of UNIX Epoch, your dates should be integers (hence the reference to the type LONG in the documentation). Your dates should be the number of days that have elapsed since January 1, 1970.

    For instance, your sample CSV should look like this:

    "i","d","f","s"
    "1","17532","1.001","something great!"
    "2","17533","2.002","something terrible!"
    "3","17534","3.003","I'm an oil man"
    

    Then you can run that exact same command:

    CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
    LOCATION 's3://mybucket/test/'
    TBLPROPERTIES ("skip.header.line.count"="1");
    

    If you query your Athena table with select * from test, you will get:

      i       d          f              s           
     --- ------------ ------- --------------------- 
      1   2018-01-01   1.001   something great!     
      2   2018-01-02   2.002   something terrible!  
      3   2018-01-03   3.003   I'm an oil man    
    

    An analogous problem also compromises the explanation on TIMESTAMP in the aforementioned documentation:

    [OpenCSVSerDe] recognizes the TIMESTAMP type if it is specified in the UNIX format, such as yyyy-mm-dd hh:mm:ss[.f...], as the type LONG.

    It seems to indicate that we should format TIMESTAMPs as yyyy-mm-dd hh:mm:ss[.f...]. Not really. In fact, we need to use UNIX Epoch Time again, but this time with the number of milliseconds that have elapsed since Midnight 1 January 1970.

    For instance, consider the following sample CSV:

    "i","d","f","s","t"
    "1","17532","1.001","something great!","1564286638027"
    "2","17533","2.002","something terrible!","1564486638027"
    "3","17534","3.003","I'm an oil man","1563486638012"
    

    And the following CREATE TABLE statement:

    CREATE EXTERNAL TABLE test (i int, d date, f  float, s string, t timestamp)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
    LOCATION 's3://mybucket/test/'
    TBLPROPERTIES ("skip.header.line.count"="1");
    

    This will be the result set for select * from test:

      i       d          f              s                       t             
     --- ------------ ------- --------------------- ------------------------- 
      1   2018-01-01   1.001   something great!      2019-07-28 04:03:58.027  
      2   2018-01-02   2.002   something terrible!   2019-07-30 11:37:18.027  
      3   2018-01-03   3.003   I'm an oil man        2019-07-18 21:50:38.012