Search code examples
sqlregexhivehiveqlhive-serde

Handling Data with and Without double quotation marks In Hive


Can someone please guide me how should I Load data in hive where I am getting " in some rows and in some rows data is coming without " for the same column value.

    Sample Data:

    id,name,desc,uqc,roll,age
    1,Monali,"abhc,jkjk",,23,23
    2,mj,nhiijkla,67,23,60
    7,jena,"kdjuu,hsysi,juw",3,34,23
    1,Monali,"/"coppers bars","rods and profiles"/",,23,23
    2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60

In above data for id '2' " are not there in desc column value.

My Create Statement:

    create external table testing(id int, 
                  name string, 
                  desc string, 
                  uqc double, 
                  roll int, 
                  age int
                 ) 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
    WITH SERDEPROPERTIES ('input.regex'='^(\\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*')
    location ....
    TBLPROPERTIES("skip.header.line.count"="1")
    ;

While loading Data I'm not getting any error. But when I doing select * from testing.select statement is not Executing. Above Create and select statement working fine If data is coming with ", but is not working if data is coming with and without ".


Solution

  • Currently third group in the regexp is enclosed in quotes (quotes are mandatory). Try to make quotes optional "? - means zero or one quote, also make group content non-greedy (.*?), so it will not catch extra quote inside the group:

    'input.regex'='^(\\d+?),(.*?),"?(.*?)"?,([0-9.]*),(\\d*),(\\d*).*' 
    

    Testing your data example using regexp_replace, I also added optional slash around 3rd group to remove it from the output:

    with mytable as (
    select stack(6,
        '1,Monali,"abhc,jkjk",,23,23',
        '2,mj,nhiijkla,67,23,60',
        '7,jena,"kdjuu,hsysi,juw",3,34,23',
        '1,Monali,"/"coppers bars","rods and profiles"/",,23,23',
        '2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60',
        '2,money,"17",19"LCD PANEL FOR COMPUTER",67,23,60'
    ) as initial_data
    )
    
    select regexp_replace(initial_data,'^(\\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\\d*),(\\d*).*',
                                       '$1 || $2 || $3 || $4 || $5 || $6'
                         ) as parsed_result
     from mytable
    

    Result (delimited by two pipes and spaces ' || '):

    parsed_result
    1 || Monali || abhc,jkjk || || 23 || 23
    2 || mj || nhiijkla || 67 || 23 || 60
    7 || jena || kdjuu,hsysi,juw || 3 || 34 || 23
    1 || Monali || "coppers bars","rods and profiles" || || 23 || 23
    2 || money || "COUPLING","FLANGES & CROSS OVER" || 67 || 23 || 60
    2 || money || 17",19"LCD PANEL FOR COMPUTER || 67 || 23 || 60
    

    So, if result looks good, use this regexp in the table DDL:

    'input.regex'='^(\\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\\d*),(\\d*).*'
    

    Carefully test it on the whole dataset and check for empty/null values, fix regexp if necessary.