I am receiving normal comma delimited CSV files with data having new line character.
Input data
I want to convert the input data to:
My file may also contain multiple lines on data (or data in newline in a single row).
Expected output data
Output file I was able to generate.
As you can see in the image that caret (^) perfectly escaped all pipes (|) in data, but also escaping the newline character in 5th and 6th line, which I don't want.
NOTE: All the carriage returns (\r, or CR) and newline (\n, LF) characters should be as it is just like shown in images.
import csv
import sys
inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
reader = csv.DictReader(inputFile, delimiter=',')
writer = csv.DictWriter(
outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
writer.writeheader()
writer.writerows(reader)
print("Formationg complete.")
The above code has been written in Python, it would be great if I can get help in Python. Answers in other programming languages also accepted.
There is more than 8 million records
Please find below some sample data:
"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","[email protected]","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","[email protected]","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","[email protected]","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e
data in new line.","","","","","","[email protected]","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""
NOTE: If you copy above data, please make sure that 5th and 6th line should end with only LF (i.e New Line, \n) just like shown in images, or else please try to replicate those 2 line as that's what this question is all about not escaping those 2 lines specificaly, as highlighted in the image below.
The above code is the final out come of all my findings on internet. I've even tried pandas library and it's final output is same as well.
Another alternateive to what I want to achive I've done using Wondows Powershell script.
((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]
Execution Ways:
Using Powershell
replace.ps1 '< path_to_file >'
Using a Batch Script
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\replace.ps1' '< path_to_csv_file >.csv'"
NOTE: Powershell V5.0 or greater required
This can process 1 Million of records in a minute or so.
What I've figured out is that we have to split bulky csv files to multiplve file with 1 Million records each and then process them all seperately.
Please correct me if I'm wrong, or there's any other alternate to it.