Search code examples
regexdatabasehivetimestamphiveddl

Parse timestamp in Hive during table creation


I have a file that looks like this:

33.49.147.163           20140416123526  https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en    29  409 Firefox/5.0

I want to load it into a hive table. I do it this way:

create external table Logs (
    ip string,
    ts timestamp,
    request string,
    page_size smallint,
    status_code smallint,
    info string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties (
"timestamp.formats" = "yyyyMMddHHmmss",
"input.regex" = '^(\\S*)\\t{3}(\\d{14})\\t(\\S*)\\t(\\S*)\\t(\\S*)\\t(\\S*).*$'
)
stored as textfile
location '/data/user_logs/user_logs_M';

And

select * from Logs limit 10;

results in

33.49.147.16 NULL https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en 29 409 Firefox/5.0

How to parse timestamps correctly, to avoid this NULLs?


Solution

  • "timestamp.formats" SerDe property works only with LazySimpleSerDe (STORED AS TEXTFILE), it does not work with RegexSerDe. If you are using RegexSerDe, then parse timestamp in a query.

    Define ts column as STRING data type in CREATE TABLE and in the query transform it like this:

    select timestamp(regexp_replace(ts,'(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})','$1-$2-$3 $4:$5:$6.0')) as ts
    

    Of course, you can extract each part of the timestamp using SerDe as separate columns and properly concatenate them with delimiters in the query to get correct timestamp format, but it will not give you any improvement because anyway you will need additional transformation in the query.