Search code examples
pythoncharacter-encodingpython-polars

handle invalid encoding sequences in csv with polars


Consider the following snippet:

from io import TextIOWrapper, BytesIO
import polars as pl
import pandas as pd

csv_str = (
    b"spam,egg\n"
    + "spam,œuf\n".encode("cp1252")
    + "spam,αυγό\n".encode("utf8")
)
content = BytesIO(csv_str)
wrapped = TextIOWrapper(content, errors="replace")

try:
    df = pl.read_csv(wrapped)
except Exception as e:
    print("polars failed!")
    print(e)

wrapped.seek(0)

try:
    df = pd.read_csv(wrapped, sep=",")
except Exception as e:
    print("pandas failed!")
    print(e)

You got there an invalid CSV a bad as there is, with two different encodings. Strangely enough, this keeps to be a real-life problem, and a too frequent one.

With pandas, you can either handle this through the TextIOWrapper or the built-in encoding_errors argument.

Questions:

  • why is this not working with polars, considering that the TextIOWrapper should handle this input as a stream?
  • is there a way to handle this natively with polars (I mean any way other than reading it with pandas then converting it with polars.from_pandas)?

Solution

  • Use pl.read_csv(wrapped, encoding='utf8-lossy').

    help(pl.read_csv)

    encoding : {'utf8', 'utf8-lossy', ...}
        Lossy means that invalid utf8 values are replaced with `�`
        characters. When using other encodings than `utf8` or
        `utf8-lossy`, the input is first decoded in memory with
        python. Defaults to `utf8`.
    

    Improved script:

    from io import TextIOWrapper, BytesIO
    import polars as pl
    import pandas as pd
    
    csv_str = (
        b"spam,egg\n"
        + "spam,œuf\n".encode("cp1252")
        + "spam,αυγό\n".encode("utf8")
    )
    content = BytesIO(csv_str)
    wrapped = TextIOWrapper(content, errors="replace")
    
    try:
        dfpl = pl.read_csv(wrapped, encoding='utf8-lossy')
        print(dfpl)
    except Exception as e:
        print("polars failed!")
        print(e)
    
    wrapped.seek(0)
    
    try:
        dfpd = pd.read_csv(wrapped, sep=",")
        print(dfpd)
    except Exception as e:
        print("pandas failed!")
        print(e)
    
    shape: (2, 2)
    ┌──────┬──────┐
    │ spam ┆ egg  │
    │ ---  ┆ ---  │
    │ str  ┆ str  │
    ╞══════╪══════╡
    │ spam ┆ �uf  │
    │ spam ┆ αυγό │
    └──────┴──────┘
       spam   egg
    0  spam   �uf
    1  spam  αυγό