I have an external table using Glue catalog and reading a CSV file. The fields are enclosed in double quotes if they have comma or a LF (LineFeed). I am able to read a field properly as a single value if there is delimiter within that field but the fields having LineFeed in it are getting split and the rest of the columns afterwards are shown as NULL.
Have used serde row format to specify the quote character. and used normal row format delimiter with line delimited by Line Feed ascii as well. But as of now, none of it seems to be working.
CREATE EXTERNAL TABLE schema.ext_table
(
id varchar (18),
name varchar (80)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' )
STORED AS textfile
LOCATION 's3://path/'
TABLE PROPERTIES ('skip.header.line.count'='1')
;
For a file like this:
id,name,addCRLF
1,abc,defCRLF
2,"a,b",mnoCRLF
3,"aLF
b",xyzCRLF
Please note that the CRLF and LF, in the above file, can be seen by notepad++ like tools.
I want the output to be like:
1 abc def
2 a,b mno
3 a xyz
b-------this b need to be in the same cell as that of the a above
BUT the Output is coming like :
1 abc def
2 a,b mno
3 a null
null null null
Got the official response from AWS support. Redshift Spectrum doesn't support embedded line breaks in a csv file.