Search code examples
csvlinefeedamazon-redshift-spectrum

Redshift External Table not handling Linefeed character within a field


I have an external table using Glue catalog and reading a CSV file. The fields are enclosed in double quotes if they have comma or a LF (LineFeed). I am able to read a field properly as a single value if there is delimiter within that field but the fields having LineFeed in it are getting split and the rest of the columns afterwards are shown as NULL.

Have used serde row format to specify the quote character. and used normal row format delimiter with line delimited by Line Feed ascii as well. But as of now, none of it seems to be working.

CREATE EXTERNAL TABLE schema.ext_table
(
id varchar (18),
name varchar (80)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' )
STORED AS textfile
LOCATION 's3://path/'
TABLE PROPERTIES ('skip.header.line.count'='1')
;

For a file like this:

id,name,addCRLF
1,abc,defCRLF
2,"a,b",mnoCRLF
3,"aLF
b",xyzCRLF

Please note that the CRLF and LF, in the above file, can be seen by notepad++ like tools.

I want the output to be like:

1   abc  def
2   a,b  mno
3   a    xyz
    b-------this b need to be in the same cell as that of the a above

BUT the Output is coming like :

1    abc  def
2    a,b  mno
3    a    null
null null null

Solution

  • Got the official response from AWS support. Redshift Spectrum doesn't support embedded line breaks in a csv file.