I have a function where the arguments passed are 5 filepaths. However, the first path is to a csv.gz where there seems to be an undefined character inside of the file. How can I work around this?
I'm using Python version 3.11.1. Code and error message shown below.
function(r"filepath1", r"filepath2", r"filepath3", r"filepath4", r"filepath5")
Error Message:
Cell In[3], line 8, in function(filepath1, filepath2, filepath3, filepath4, filepath5)
6 file1DateMap = {}
7 infd = open(file1path1, 'r')
8 infd.readline()
9 for line in infd:
10 tokens = line.strip().split(',')
File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 94: character maps to undefined
I tried
file = open(filename, encoding="utf8")
but encoding was undefined in my version of Python.
I tried the "with open" method
file2 = r"file2path"
file3 = r"file3path"
file4 = r"file4path"
file5 = r"file5path"
file1name = r"file1path"
with open(file1name, 'r') as file1:
function(file1, file2, file3, file4, file5)
but the function was expecting a string:
TypeError: expected str, bytes or os.PathLike object, not TextIOWrapper
I am expecting the function to run and write the processed output to folders on my desktop.
UPDATE
I checked the encoding of the file in Visual Studio Code, it stated UTF 8. I wrote the following code:
with open(r"path1", encoding="utf8") as openfile1:
file1 = openfile1.read()
Received this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
UPDATE 2
Checked encoding with this code
with open(r"filepath1") as f:
print(f)
encoding='cp1252'
However now when I pass the new encoding argument:
with open(r"path1", encoding="cp1252") as openfile1:
file1 = openfile1.read()
I am back to square 1 with the following error message:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 94: character maps to undefined
UPDATE 3
Gzip worked. I used the following code:
import gzip
with gzip.open(r"path1", mode="rb") as openfile1:
file1 = openfile1.read()
If you have a CSV file compressed into a gzip file, you should be able to read the gzip file as simply as:
with gzip.open("input.csv.gz", "rt", newline="", encoding="utf-8") as f:
I believe you'll want rt
to read it as text (and not rb
which will return non-decoded bytes); and of course pick the actual encoding of the file (I always use utf-8 for my examples).
To further decode the CSV in the text file f
, I recommend using the standard library's csv module:
...
reader = csv.reader(f)
for row in reader:
print(row)