Search code examples
pythoncsverror-handlingunicodedecode

Opening files (csv.gz) containing undefined characters and passing files into function


I have a function where the arguments passed are 5 filepaths. However, the first path is to a csv.gz where there seems to be an undefined character inside of the file. How can I work around this?

I'm using Python version 3.11.1. Code and error message shown below.

function(r"filepath1", r"filepath2", r"filepath3", r"filepath4", r"filepath5")

Error Message:

Cell In[3], line 8, in function(filepath1, filepath2, filepath3, filepath4, filepath5)
 6 file1DateMap = {}
 7 infd = open(file1path1, 'r')
 8 infd.readline()
 9 for line in infd:
10     tokens = line.strip().split(',')
 
File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]
 
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 94: character maps to undefined

I tried

file = open(filename, encoding="utf8")

but encoding was undefined in my version of Python.

I tried the "with open" method

file2 = r"file2path"
file3 = r"file3path"
file4 = r"file4path"
file5 = r"file5path"
file1name = r"file1path"
with open(file1name, 'r') as file1:
    function(file1, file2, file3, file4, file5)

but the function was expecting a string:

TypeError: expected str, bytes or os.PathLike object, not TextIOWrapper

I am expecting the function to run and write the processed output to folders on my desktop.

UPDATE

I checked the encoding of the file in Visual Studio Code, it stated UTF 8. I wrote the following code:

with open(r"path1", encoding="utf8") as openfile1:
    file1 = openfile1.read()

Received this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

UPDATE 2

Checked encoding with this code

with open(r"filepath1") as f:
    print(f)

encoding='cp1252'

However now when I pass the new encoding argument:

with open(r"path1", encoding="cp1252") as openfile1:
    file1 = openfile1.read()

I am back to square 1 with the following error message:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 94: character maps to undefined

UPDATE 3

Gzip worked. I used the following code:

import gzip
with gzip.open(r"path1", mode="rb") as openfile1:
    file1 = openfile1.read()

Solution

  • If you have a CSV file compressed into a gzip file, you should be able to read the gzip file as simply as:

    with gzip.open("input.csv.gz", "rt", newline="", encoding="utf-8") as f:
    

    I believe you'll want rt to read it as text (and not rb which will return non-decoded bytes); and of course pick the actual encoding of the file (I always use utf-8 for my examples).

    To further decode the CSV in the text file f, I recommend using the standard library's csv module:

    ...
        reader = csv.reader(f)
        for row in reader:
            print(row)