Search code examples
csvpyarrowapache-arrow

How does Pyarrow read_csv handle different file encodings?


I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still works without issue(which is great! but i dont understand why / if it only auto handles certain encodings). The only parameters im using are setting the delimiter="|" (with ParseOptions) and auto_dict_encode=True with (ConvertOptions).

How is pyarrow handling different encoding types?


Solution

  • pyarrow currently has no functionality to deal with different encodings, and assumes UTF8 for string/text data.
    But the reason it doesn't raise an error is that pyarrow will read any non-UTF8 strings as a "binary" type column, instead of "string" type.

    A small example:

    # writing a small file with latin encoding 
    with open("test.csv", "w", encoding="latin") as f: 
        f.writelines(["col1,col2\n", "u,ù"])
    

    Reading with pyarrow gives string for the first column (which only contains ASCII characters, thus also valid UTF8), but reads the second column as binary:

    >>> from pyarrow import csv 
    >>> csv.read_csv("test.csv")
    pyarrow.Table
    col1: string
    col2: binary
    

    With pandas you indeed get an error by default (because pandas has no binary data type, and will try to read all text columns as python strings, thus UTF8):

    >>> pd.read_csv("test.csv")
    ...
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 0: invalid start byte
    
    >>> pd.read_csv("test.csv", encoding="latin")
    
      col1 col2
    0    u    ù