Search code examples
databasecsvhadoophiverdbms

Error loading CSV to table


My CSV file contains details about employees of company. One column contains salary of employees (eg- $4,000). So when I write a script to load data into a table by ',' delimited it takes my salary column as 4 & 000 seperately. How to deal with that ?

CSV File example - Australia,35-39,Consulting,Start Up (1-25),Web Application Developer,"$10,001 - $25,000",Enterprise

Table code-

create table survey ( country string, age string, industryType string, companyType string, occupation string, salary string, project string) row format delimited fields terminated by ',' ;


Solution

  • References:

    Conform to Apache's DDL syntax for Create/Drop/Truncate Table, and, more specifically, to Row Formats & SerDe, in order to change the FIELDS/LINES/etc properties, you can use:

    • the native SerDe (if ROW FORMAT is not specified, or ROW FORMAT DELIMITED is set), or
    • a custom SerDe (by applying the SERDE clause).

    Option 1: Using the native Serde

    CREATE TABLE survey(country string, age string, industryType string, companyType string, occupation string, salary string, project string)
    ROW FORMAT DELIMITED 
        FIELDS 
            TERMINATED BY "," 
            ESCAPED BY "\\"
        LINES TERMINATED BY "\n"
    STORED AS TEXTFILE;
    

    Notice the lack of a clause which could specify the character to quote the field values.

    Option 2: Using the custom Serde

    See the record named "CSV/TSV" in the table from Row Formats & SerDe.

    CREATE TABLE survey(country string, age string, industryType string, companyType string, occupation string, salary string, project string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
        "separatorChar" = ",",
        "quoteChar"     = "\"",
        "escapeChar"    = "\\"
    )     
    STORED AS TEXTFILE;
    

    I suggest you to make a test with a CSV file containing one line of values. Inside the line enclose all column values into double quotes and see what happens. Beeing enclosed, the "," character (like the one in salary field) will/should be automatically escaped and, therefore, saved as part of the corresponding column value.

    Good luck.