Search code examples
hivecreate-tablehive-serdehiveddl

how to separate columns in hive


I have a file:

id,name,address
001,adam,1-A102,mont vert
002,michael,57-D,costa rica

I have to create a hive table which will contain three columns : id, name and address using comma delimited but here the address column itself contains comma in between. How are we going to handle this.


Solution

  • One possible solution is using RegexSerDe:

    CREATE TABLE table my_table (
        id       string,
        name     string,
        address  string
    ) 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES ('input.regex'='^(.*?),(.*?),(.*?)$') 
    location 'put location here' 
    ;
    

    Replace location property with your table location and put the file(s) into that location.

    First group (.*?) will match everything before first comma, second group will match everything after first comma and before second comma and third group will match everything after second comma.

    Also add TBLPROPERTIES("skip.header.line.count"="1") if you need to skip header and it always exists in the file. If header can be absent, then you can filter header rows using where id !='id'

    Also you can easily test Regex for extracting columns even without creating table, like this:

    select regexp_replace('002,michael,57-D,costa rica','^(.*?),(.*?),(.*?)$','$1|$2|$3');
    

    Result:

    002|michael|57-D,costa rica
    

    In this example query returns three groups, separated by |. In such way you can easily test your regular expression, check if groups are defined correctly before creating the table with it.

    Answering question in the comment. You can have address with comma and one more column without comma like this:

    select regexp_replace('001,adam,1-A102, mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
    

    Returns:

    001|adam|1-A102, mont vert|sydney
    

    Checking comma is optional in Address column:

    hive> select regexp_replace('001,adam,1-A102 mont vert,sydney','^(.*?),(.*?),(.*?),([^,]*?)$','$1|$2|$3|$4');
    

    Returns:

    001|adam|1-A102 mont vert|sydney
    

    Read this article for better understanding: https://community.cloudera.com/t5/Community-Articles/Using-Regular-Expressions-to-Extract-Fields-for-Hive-Tables/ta-p/247562

    [^,] means not a comma, last column can be everything except comma.

    And of course add one more column to the DDL.