amazon-web-services apache-spark amazon-s3 amazon-redshift amazon-redshift-spectrum

How to unload Redshift data containing newline characters into single line?

I am trying to unload redshift data and read it using spark. But I am getting more rows in unloaded data as compared to original records in Redshift table. This is happening because of new line characters in data. How can I unload the data so that I can read it safely using spark (without facing malformed records issue)

Sample data of one column -

5.7.1 [23.253.182.55      18] Our system has detected that this message is
5.7.1 likely suspicious due to the very low reputation of the sending IP
5.7.1 address. To best protect our users from spam, the message has been
5.7.1 blocked. Please visit
5.7.1  https://support.google.com/mail/answer/188131 for more information. o68-v6si2862477itg.128 - gsmtp

I tried below three combinations while unloading but no luck -

ALLOWOVERWRITE delimiter '\001' escape;

ALLOWOVERWRITE delimiter '\001' escape addquotes;

ALLOWOVERWRITE delimiter '\001' addquotes;

Solution

the unload syntax requires a select statement (an arbitrary query) as the input, not just the source table name, so you can replace newline symbols with empty strings or spaces in the query:

replace(your_text_field,'\n',' ')