I have a file I'm feeding to Informatica Developer. I know the file is badly formated, but I cannot change it.
This file is a standard CSV file, using semicolons as delimiters. The fields are not between quotes. It uses CRLF end of line characters.
The problem comes from the fact that a few fields contain LF characters.
When I use LF as the row delimiter, the fields containing LF characters break and are not imported right.
I've tried using CR as the row delimiter, but it means that the first field of every row (except the first one) contains "\n". Currently I'm deleting the first character of the first field to get rid of the "\n", but it means that the first character of the first field of the first row is deleted even though it's a legitimate character. I haven't been able to find the right regex to separate that row from the others.
Is there a solution I'm not seeing?
According to your explanations:
If you have PERL, you can process the lines by :
"
The line performs this task.
perl -lpe 'BEGIN {$/ = "\r\n"} s/([^;]+)/"\1"/g' < records.csv
Some explanations:
-l
: "chomps" (remove) the record separator before processing a record line-p
: iterates over processed lines and printing the result (see man perlrun)-e
: executes a given script to process each read linesBEGIN {$/ = "\r\n"}
is a special bloc executed at the beginning, so before the iteration; it tells the end of line is CRLF (\r\n
)s/([^;]+)/"\1"/g
tells to surround each field (a sequence without semicolons) by double quotes ; this script is iterated by -p
Then your CSV datas are protected, and ready to be feed.
Example:
input (special characters represented):
barf;berf\nbirf;borf\r\n
parf;perf\npirf;porf\r\n
result:
"barf";"berf
birf";"borf"
"parf";"perf
pirf";"porf"