Search code examples
csvhivecreate-tablehiveddl

Is there a possiblilty to match different csv files into one hive table?


I have 53 csv files, each with different column names and contents. I want to write all the data in the csv file into one big table in Hive that contains all the necessary columns. So it may happen that columns are left empty for certain files. This depends on the csv file. Unfortunately I have no idea how to deal with this problem.

Is there a possibility to match column names from csv files to the Hive table? I've only find solutions with non-dynamic columns in the csv.

Do I have to set up a table for each file and join them afterwards?

Thank you!


Solution

  • Do I have to set up a table for each file and join them afterwards?

    Yes, setup a table for each file and put file in each table location.

    Some tables can be combined. For example if you have col2, col2, col3 in one CSV file and col1, col2, col3, col4, col5 in second CSV file (common columns are in the same positions and extra columns at the end), then you can create single table with superset of columns, columns absent in first file will be selected as NULL and not NULL from second file, put both file into the same table location.

    Also you can combine different files (same columns positioned differently) into single table location (table should have as many string columns as in the widest file), then in the select you can use INPUT__FILE__NAME pseudocolumn, parse filename from it and calculate columns depending on it, for example something like this:

    select 
    case when INPUT__FILE__NAME rlike 'invoice\\.csv' then col1 
         when INPUT__FILE__NAME rlike 'transaction\\.csv' then col3
         else NULL
     end as invoice_nbr
    

    After you created all tables, you can join them or UNION ALL + aggregate depending on the requirements.

    Much better solution is to use JSON instead of CSV in this case:

    • JSON contains names, position does not matter
    • JSON allows absence of attributes
    • If some attribute is not in the JSON record, NULL returned as column value for that record
    • you can put JSON files with different content into the same location and create a table with superset of columns, see how to create table answer 1, answer 2 and answer 3