Search code examples
pythonescapingnewlinedelimiterbulk

How to escape only delimiter and not the newline character in CSV


I am receiving normal comma delimited CSV files with data having new line character.

Input data

I want to convert the input data to:

  1. Pipe (|) delimited
  2. Without any quotes to escape (" or ')
  3. Pipe (|) within data escaped with a caret (^) character

My file may also contain multiple lines on data (or data in newline in a single row).

Expected output data

Output file I was able to generate.

Output Data

As you can see in the image that caret (^) perfectly escaped all pipes (|) in data, but also escaping the newline character in 5th and 6th line, which I don't want.

NOTE: All the carriage returns (\r, or CR) and newline (\n, LF) characters should be as it is just like shown in images.

import csv
import sys

inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
    with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
        reader = csv.DictReader(inputFile, delimiter=',')
        writer = csv.DictWriter(
            outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
        writer.writeheader()
        writer.writerows(reader)

print("Formationg complete.")

The above code has been written in Python, it would be great if I can get help in Python. Answers in other programming languages also accepted.

There is more than 8 million records

Please find below some sample data:

"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","[email protected]","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","[email protected]","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","[email protected]","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e 
data in new line.","","","","","","[email protected]","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""

NOTE: If you copy above data, please make sure that 5th and 6th line should end with only LF (i.e New Line, \n) just like shown in images, or else please try to replicate those 2 line as that's what this question is all about not escaping those 2 lines specificaly, as highlighted in the image below.

The above code is the final out come of all my findings on internet. I've even tried pandas library and it's final output is same as well.


Solution

  • Another alternateive to what I want to achive I've done using Wondows Powershell script.

    ((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
    ((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
    ((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
    ((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
    ((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
    ((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]
    

    Execution Ways:

    1. Using Powershell

      replace.ps1 '< path_to_file >'

    2. Using a Batch Script

      C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\replace.ps1' '< path_to_csv_file >.csv'"

    NOTE: Powershell V5.0 or greater required

    This can process 1 Million of records in a minute or so.

    What I've figured out is that we have to split bulky csv files to multiplve file with 1 Million records each and then process them all seperately.

    Please correct me if I'm wrong, or there's any other alternate to it.