Search code examples
pythoncsvunicodepython-3.9

\ufeff is appearing while reading csv using unicodecsv module


I have following code

import unicodecsv
CSV_PARAMS = dict(delimiter=",", quotechar='"', lineterminator='\n')
unireader = unicodecsv.reader(open('sample.csv', 'rb'), **CSV_PARAMS)
for line in unireader:
    print(line)

and it prints

['\ufeff"003', 'word one"']
['003,word two']
['003,word three']

The CSV looks like this

"003,word one"
"003,word two"
"003,word three"

I am unable to figure out why the first row has \ufeff (which is i believe a file marker). Moreover, there is " at the beginning of first row.

The CSV file is comign from client so i can't dictate them how to save a file etc. Looking to fix my code so that it can handle encoding.

Note: I have already tried passing encoding='utf8' to CSV_PARAMS and it didn't solve the problem


Solution

  • encoding='utf-8-sig' will remove the UTF-8-encoded BOM (byte order mark) used as a UTF-8 signature in some files:

    import unicodecsv
    
    with open('sample.csv','rb') as f:
        r = unicodecsv.reader(f, encoding='utf-8-sig')
        for line in r:
            print(line)
    

    Output:

    ['003,word one']
    ['003,word two']
    ['003,word three']
    

    But why are you using the third-party unicodecsv with Python 3? The built-in csv module handles Unicode correctly:

    import csv
    
    # Note, newline='' is a documented requirement for the csv module
    # for reading and writing CSV files.
    with open('sample.csv', encoding='utf-8-sig', newline='') as f:
        r = csv.reader(f)
        for line in r:
            print(line)