Given : Hive Version 2.3.0 onwards
, I have a Hive
table and a fixed DDL from a long time. Now raw data
is coming in different order of columns
as text
files and have to store data in parquet
format with fixed partition criteria. My question is how to handle such situation when incoming data have different arrangement of columns
.
Example :
CREATE TABLE users ( col1 string, col2 int, col3 string ... )
PARTITIONED BY (...)
STORED AS PARQUET;
and incoming data arrangement is like
col1 col3 col2
(row) x p 1
y q 2
in text
files, notice column order changes.
I got a hard time finding correct information, can anyone explain best practices how to handle such situation? If it were small file, we can use scripts to correct text but if its in bulk and each time text files have different arrangement, what to do ? Appreciate any answer/ feedback.
With changing column order and/or addition/deletion of columns, one option is to convert the text files to Parquet
format before loading the files to a Hive table.Set the property hive.parquet.use-column-names = true
which is false
by default, to read the Parquet files by column names rather than by column index.(This way you eliminate the dependency on column order in the source file) Partitions can have different schemas and you can create a table with the desired overall columns.
Note that an external table
is easier to maintain compared to a managed table, without having to move data around when schema changes. When the schema changes, you can drop and re-create the table and execute an msck repair table ..
to read the data.
To detect schema changes, you can have a process running that checks the first row of the text files (assuming they are column names) for any changes. Output of this process can be written to persistent storage like a MongoDB/DynamoDB data store with appropriate schema versioning.This helps retain history of all the schema changes.