Search code examples
hivehiveqlavroparquetimpala

How to achieve change of schema in parquet format


Just a design issue we are facing.

I have a hive external table in parquet format with following columns:

describe payments_user
col_name,data_type,comment
('amount_hold', 'int', '')
('id', 'int', '')
('transaction_id', 'string', '')
('recipient_id', 'string', '')
('year', 'string', '')
('month', 'string', '')
('day', 'string', '')
('', None, None)
('# Partition Information', None, None)
('# col_name            ', 'data_type           ', 'comment             ')
('', None, None)
('year', 'string', '')
('month', 'string', '')
('day', 'string', '')

We get the data on daily basis which we ingest into partitions dynamically which are year, month and day. So if the data on the source side is to be changed where they add a new column and send the batch file, how can we ingest the data. I know avro has this capability but inorder to reduce the rework how can this be achieved in parquet format?

If avro what is the procedure?


Solution

  • what you are looking for is schema evolution, it is supported by Hive with some limitations compared with AVRO.

    Schema evolution in parquet format