Search code examples
regexcsvinformaticaeol

Handling CSV file with CRLF record terminators and unquoted multiline fields with LF terminators for Informatica Developer


I have a file I'm feeding to Informatica Developer. I know the file is badly formated, but I cannot change it.

This file is a standard CSV file, using semicolons as delimiters. The fields are not between quotes. It uses CRLF end of line characters.

The problem comes from the fact that a few fields contain LF characters.

When I use LF as the row delimiter, the fields containing LF characters break and are not imported right.

I've tried using CR as the row delimiter, but it means that the first field of every row (except the first one) contains "\n". Currently I'm deleting the first character of the first field to get rid of the "\n", but it means that the first character of the first field of the first row is deleted even though it's a legitimate character. I haven't been able to find the right regex to separate that row from the others.

Is there a solution I'm not seeing?


Solution

  • According to your explanations:

    • Lines terminates by a CRLF sequence
    • Fields can be multiline with LF lines terminators
    • It sounds like those LF are part of these fields, and therefore must be kept

    If you have PERL, you can process the lines by :

    1. setting CRLFs as line terminators also called record separators
    2. surrounding the values by double quotes "

    The line performs this task.

    perl -lpe  'BEGIN {$/ = "\r\n"} s/([^;]+)/"\1"/g' < records.csv
    

    Some explanations:

    • -l: "chomps" (remove) the record separator before processing a record line
    • -p: iterates over processed lines and printing the result (see man perlrun)
    • -e: executes a given script to process each read lines
    • about the script:
      • BEGIN {$/ = "\r\n"} is a special bloc executed at the beginning, so before the iteration; it tells the end of line is CRLF (\r\n)
      • s/([^;]+)/"\1"/g tells to surround each field (a sequence without semicolons) by double quotes ; this script is iterated by -p

    Then your CSV datas are protected, and ready to be feed.

    Example:

    • input (special characters represented):

       barf;berf\nbirf;borf\r\n
       parf;perf\npirf;porf\r\n
      
    • result:

       "barf";"berf
       birf";"borf"
       "parf";"perf
       pirf";"porf"