Search code examples
apachehadoophiveavro

Import data from .avro files to hive table


I created a hive table by following command and avro schema i had.

CREATE TABLE table_name
PARTITIONED BY (t string, y string, m string, d string, h string, hh string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs://location/schema.avsc');

Now i want to dump data i have in HDFS to created table.

I have an HDFS location, where i have data in a directory structure as t/y/m/d/h/hh/data.avro I have multiple directories according to partition because that are my partition columns.

I want to dump all the data into created table.

I tried using External table, but it is giving exceptions.


Solution

  • if you follow the hive convention in your hdfs folder and create the hive table pointing to the table location, you should run the msck repair table statement.

    For example

    CREATE TABLE table_name
    PARTITIONED BY (t string, y string, m string, d string, h string, hh string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    TBLPROPERTIES (
    'avro.schema.url'='hdfs://location/schema/schema.avsc')
    location "hdfs:///location/data;
    

    and load the data like

    /location/data/y=2016/m=02/d=03/h=03/hh=12/data.avro /location/data/y=2016/m=02/d=03/h=03/hh=13/data2.avro

    in that way you will be able to load the data with the following statement because Hive will recognize the partitions

    msck repair table table_name;
    

    if you don't want to do this, you can use the add partition like

    ALTER TABLE table_nameADD PARTITION (y='01',m='02',d='03',h='03',hh='12') location '/yourpath/y=2016/m=02/d=03/h=03/hh=12/'; (or any other folder)