I have a .dat file that I had been reading with pd.read_csv
and always needed to use encoding="latin"
for it to read properly / without error. When I use pyarrow.csv.read_csv
I dont see a parameter to select the encoding of the file but it still works without issue(which is great! but i dont understand why / if it only auto handles certain encodings). The only parameters im using are setting the delimiter="|"
(with ParseOptions) and auto_dict_encode=True
with (ConvertOptions).
How is pyarrow handling different encoding types?
pyarrow
currently has no functionality to deal with different encodings, and assumes UTF8 for string/text data.
But the reason it doesn't raise an error is that pyarrow will read any non-UTF8 strings as a "binary" type column, instead of "string" type.
A small example:
# writing a small file with latin encoding
with open("test.csv", "w", encoding="latin") as f:
f.writelines(["col1,col2\n", "u,ù"])
Reading with pyarrow gives string for the first column (which only contains ASCII characters, thus also valid UTF8), but reads the second column as binary:
>>> from pyarrow import csv
>>> csv.read_csv("test.csv")
pyarrow.Table
col1: string
col2: binary
With pandas you indeed get an error by default (because pandas has no binary data type, and will try to read all text columns as python strings, thus UTF8):
>>> pd.read_csv("test.csv")
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 0: invalid start byte
>>> pd.read_csv("test.csv", encoding="latin")
col1 col2
0 u ù