Search code examples
pythonpython-2.7csvtextexcel-2013

CSV Writer truncates characters in sequence in Excel 2013


I have an interesting situation with Python's csv module. I have a function that takes specific lines from a text file and writes them to csv file:

import os
import csv

def csv_save_use(textfile, csvfile):
    with open(textfile, "rb") as text:
        for line in text:
            line=line.strip()
            with open(csvfile, "ab") as f:            
                if line.startswith("# Online_Resource"):   
                    write = csv.writer(f, dialect='excel',
                                       delimiter='\t',
                                       lineterminator="\t",
                                       )
                    write.writerow([line.lstrip("# ")])

                if line.startswith("##"):
                    write = csv.writer(f, dialect='excel',
                                       delimiter='\t',
                                       lineterminator="\t",
                                       )
                    write.writerow([line.lstrip("# ")])

Here is a sample of some strings from the original text file:

# Online_Resource: https://www.ncdc.noaa.gov/
## Corg%    percent organic carbon,,,%,,paleoceanography,,,N

What is really bizarre is the final csv file looks good, except the characters in the first column only (those with the # originally) partially "overwrite" each other when I try to manually delete some characters from the cell:

Result

Oddly enough, too, there seems to be no formula to how the characters get jumbled each time I try to delete some after running the script. I tried encoding the csv file as unicode to no avail.

Thanks.


Solution

  • You've selected excel dialect but you overrode it with weird parameters:

    • You're using TAB as separator and line terminator, which creates a 1-line CSV file. Close enough to "truncated" to me
    • Also quotechar shouldn't be a space.

    This conveyed a nice side-effect as you noted: the csv module actually splits the lines according to commas!

    The code is inefficient and error-prone: you're opening the file in append mode in the loop and create a new csv writer each time. Better done outside the loop.

    Also, comma split must be done by hand now. So even better: use csv module to read the file as well. My fix proposal for your routine:

    import os
    import csv
    
    def csv_save_use(textfile, csvfile):
      with open(textfile, "rU") as text, open(csvfile, "wb") as f:
        write = csv.writer(f, dialect='excel',
                           delimiter='\t')
        reader = csv.reader(text, delimiter=",")
        for row in reader:
            if not row:
                continue  # skip possible empty rows            
            if row[0].startswith("# Online_Resource"):
                write.writerow([row[0].lstrip("# ")])
    
            elif row[0].startswith("##"):
                write.writerow([row[0].lstrip("# ")]+row[1:]) # write row, stripping the first item from hashes
    

    Note that the file isn't properly displayed in excel unless to remove delimiter='\t (reverts back to default comma)

    Also note that you need to replace open(csvfile, "wb") as f by open(csvfile, "w",newline='') as f for Python 3.

    here's how the output looks now (note that the empty cells are because there are several commas in a row)

    enter image description here