Search code examples
hadoophiveemramazon-emramazon-athena

Hive on AWS: convert S3 JSON to Columnar preserving partitions


I have files in S3 that contain many lines of JSON (separated by newline). I want to convert these files to a Columnar Format for consumption by AWS Athena

I am following the Converting to Columnar Formats guide to do this, however when converted to ORC, the partition convention in S3 is lost.

In this example, how do you preserve the dt partition in the converted to parquet s3 folder structure? When I run the example it just outputs s3://myBucket/pq/000000_0 and NOT s3://myBucket/pq/dt=2009-04-14-04-05/000000_0

Here is the HQL that sets up interface to bring JSON into a Hive table:

CREATE EXTERNAL TABLE impressions (
  requestBeginTime string,
  adId string,
  impressionId string,
  referrer string,
  userAgent string,
  userCookie string,
  ip string,
  number string,
  processId string,
  browserCookie string,
  requestEndTime string,
  timers struct<modelLookup:string, requestTime:string>,
  threadId string,
  hostname string,
  sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions' ;
msck repair table impressions;

Here is the HQL that converts to Parquet

CREATE EXTERNAL TABLE  parquet_hive (
  requestBeginTime string,
  adId string,
  impressionId string,
  referrer string,
  userAgent string,
  userCookie string,
  ip string)
STORED AS PARQUET
LOCATION 's3://mybucket/pq/';

INSERT OVERWRITE TABLE parquet_hive SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip FROM impressions where dt='2009-04-14-04-05';

Solution

  • First of all, Add PARTITIONED BY (dt string) to parquet_hive definition.

    Second -
    If you want to insert the data, partition by partition, you have to declare the partition you are inserting into. Note the PARTITION (dt='2009-04-14-04-05')

    INSERT OVERWRITE TABLE parquet_hive PARTITION (dt='2009-04-14-04-05')
    SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip 
    FROM   impressions where dt='2009-04-14-04-05'
    ;
    

    An easier way would be to use dynamic partitioning.
    Note the PARTITION (dt) and the dt as a last column in in the SELECT.

    You might need to to set hive.exec.dynamic.partition.mode.

    set hive.exec.dynamic.partition.mode=nonstrict;
    

    INSERT OVERWRITE TABLE parquet_hive PARTITION (dt)
    SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip,dt 
    FROM   impressions where dt='2009-04-14-04-05'
    ;
    

    P.s.
    CREATE EXTERNAL TABLE impressions does not "reads the JSON into a Hive table".
    It is just an interface with the necessary information to read the HDFS files.
    `