My CSV file contains details about employees of company. One column contains salary of employees (eg- $4,000). So when I write a script to load data into a table by ',' delimited it takes my salary column as 4 & 000 seperately. How to deal with that ?
CSV File example - Australia,35-39,Consulting,Start Up (1-25),Web Application Developer,"$10,001 - $25,000",Enterprise
Table code-
create table survey ( country string, age string, industryType string, companyType string, occupation string, salary string, project string) row format delimited fields terminated by ',' ;
Conform to Apache's DDL syntax for Create/Drop/Truncate Table, and, more specifically, to Row Formats & SerDe, in order to change the FIELDS/LINES/etc
properties, you can use:
ROW FORMAT
is not specified, or ROW FORMAT DELIMITED
is set), or SERDE
clause).CREATE TABLE survey(country string, age string, industryType string, companyType string, occupation string, salary string, project string)
ROW FORMAT DELIMITED
FIELDS
TERMINATED BY ","
ESCAPED BY "\\"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
Notice the lack of a clause which could specify the character to quote the field values.
See the record named "CSV/TSV" in the table from Row Formats & SerDe.
CREATE TABLE survey(country string, age string, industryType string, companyType string, occupation string, salary string, project string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
I suggest you to make a test with a CSV file containing one line of values. Inside the line enclose all column values into double quotes and see what happens. Beeing enclosed, the "," character (like the one in salary
field) will/should be automatically escaped and, therefore, saved as part of the corresponding column value.
Good luck.