Search code examples
sqlhiveprestoamazon-athenaexternal-tables

Create external table by adding two columns present in CSV file in Hive/Athena


I have a CSV file having three columns A, B, C.

I would like to create an external table on top of it having two columns A, (B + C).

Is it possible to make it happen during the table creation itself or should I create a view once the table is created?


Solution

  • It is not possible using CSVSerDe or LasySimpleSerDe but possible using RegexSerDe. Each column in table definition should have corresponding capturing group () in input.regex.

    For example if file is comma separated, table can be defined like this:

     CREATE EXTERNAL TABLE mytable(
     colA string COMMENT '', 
     colBC string COMMENT '')
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.serde2.RegexSerDe' 
    WITH SERDEPROPERTIES ( 
      'input.regex'= '^"?(.*?)"?,(.*)$'
    LOCATION ...
    

    input.regex means:

    1. ^ - beginning of the string anchor
    2. "? - optional quote
    3. (.*?) - capturing group for 1st column - any character any times not greedy
    4. "? - optional quote
    5. , - comma
    6. (.*) - capturing group for 2nd column (this will capture everything after first comma till the end)
    7. $ - end of the string anchor

    You can write regex that will capture your data correctly. You can debug regex without creating table, use regexp_replace :

    select regexp_replace('"A",B,C', --data string example
                          '^"?(.*?)"?,(.*)$', --regex with 2 capturing groups for 2 columns
                         '$1 | $2'); --pipe delimited columns should be returned