hive bigdata hiveql devops data-warehouse

How to insert raw data into hive table having different column sequence?

Given : Hive Version 2.3.0 onwards, I have a Hive table and a fixed DDL from a long time. Now raw data is coming in different order of columns as text files and have to store data in parquet format with fixed partition criteria. My question is how to handle such situation when incoming data have different arrangement of columns.

Example :

CREATE TABLE users ( col1 string, col2 int, col3 string ... )
PARTITIONED BY (...)
STORED AS PARQUET;

and incoming data arrangement is like

      col1 col3 col2
(row)  x    p    1
       y    q    2

in text files, notice column order changes.

I got a hard time finding correct information, can anyone explain best practices how to handle such situation? If it were small file, we can use scripts to correct text but if its in bulk and each time text files have different arrangement, what to do ? Appreciate any answer/ feedback.

Solution

With changing column order and/or addition/deletion of columns, one option is to convert the text files to Parquet format before loading the files to a Hive table.Set the property hive.parquet.use-column-names = true which is false by default, to read the Parquet files by column names rather than by column index.(This way you eliminate the dependency on column order in the source file) Partitions can have different schemas and you can create a table with the desired overall columns.

Note that an external table is easier to maintain compared to a managed table, without having to move data around when schema changes. When the schema changes, you can drop and re-create the table and execute an msck repair table .. to read the data.

To detect schema changes, you can have a process running that checks the first row of the text files (assuming they are column names) for any changes. Output of this process can be written to persistent storage like a MongoDB/DynamoDB data store with appropriate schema versioning.This helps retain history of all the schema changes.