Search code examples
pythonparsingcsvawkquotes

Removing in-field quotes in csv file


Let's say we have a comma separated file (csv) like this:

"name of movie","starring","director","release year"
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012"
"the dark knight","christian bale, heath ledger","christopher nolan","2008"
"The "day" when earth stood still","Michael Rennie,the 'strong' man","robert wise","1951"
"the 'gladiator'","russel "the awesome" crowe","ridley scott","2000"

As you can see from above, in lines 4 & 5 there is quotes within quotes. The output should look something like this:

"name of movie","starring","director","release year"
"dark knight rises","christian bale, anna hathaway","christopher nolan","2012"
"the dark knight","christian bale, heath ledger","christopher nolan","2008"
"The day when earth stood still","Michael Rennie,the strong man","robert wise","1951"
"the gladiator","russel the awesome crowe","ridley scott","2000"

How to get rid of such quotes (both single and double) that occur within quotes like this on a csv file. Note that comma within a single field is okay as the parser identifies that it's within quotes and takes it as one field. This is just a preprocessing step of arranging csv files so that it can be fed into multiple parsers to convert into any format we desire. Bash, awk, python all works. Please no perl, I'm sick of that language :D Thanks in advance!


Solution

  • How about

    import csv
    
    def remove_quotes(s):
        return ''.join(c for c in s if c not in ('"', "'"))
    
    with open("fixquote.csv","rb") as infile, open("fixed.csv","wb") as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
        for line in reader:
            writer.writerow([remove_quotes(elem) for elem in line])
    

    which produces

    ~/coding$ cat fixed.csv 
    "name of movie","starring","director","release year"
    "dark knight rises","christian bale, anna hathaway","christopher nolan","2012"
    "the dark knight","christian bale, heath ledger","christopher nolan","2008"
    "The day when earth stood still","Michael Rennie,the strong man","robert wise","1951"
    "the gladiator","russel the awesome crowe","ridley scott","2000"
    

    BTW, you might want to check the spelling of some of those names..