handle invalid encoding sequences in csv with polars

Consider the following snippet:

from io import TextIOWrapper, BytesIO
import polars as pl
import pandas as pd

csv_str = (
    b"spam,egg\n"
    + "spam,œuf\n".encode("cp1252")
    + "spam,αυγό\n".encode("utf8")
)
content = BytesIO(csv_str)
wrapped = TextIOWrapper(content, errors="replace")

try:
    df = pl.read_csv(wrapped)
except Exception as e:
    print("polars failed!")
    print(e)

wrapped.seek(0)

try:
    df = pd.read_csv(wrapped, sep=",")
except Exception as e:
    print("pandas failed!")
    print(e)

You got there an invalid CSV a bad as there is, with two different encodings. Strangely enough, this keeps to be a real-life problem, and a too frequent one.

With pandas, you can either handle this through the TextIOWrapper or the built-in encoding_errors argument.

Questions:

why is this not working with polars, considering that the TextIOWrapper should handle this input as a stream?
is there a way to handle this natively with polars (I mean any way other than reading it with pandas then converting it with polars.from_pandas)?

Solution

Use pl.read_csv(wrapped, encoding='utf8-lossy').

help(pl.read_csv)

encoding : {'utf8', 'utf8-lossy', ...}
    Lossy means that invalid utf8 values are replaced with `�`
    characters. When using other encodings than `utf8` or
    `utf8-lossy`, the input is first decoded in memory with
    python. Defaults to `utf8`.

Improved script:

from io import TextIOWrapper, BytesIO
import polars as pl
import pandas as pd

csv_str = (
    b"spam,egg\n"
    + "spam,œuf\n".encode("cp1252")
    + "spam,αυγό\n".encode("utf8")
)
content = BytesIO(csv_str)
wrapped = TextIOWrapper(content, errors="replace")

try:
    dfpl = pl.read_csv(wrapped, encoding='utf8-lossy')
    print(dfpl)
except Exception as e:
    print("polars failed!")
    print(e)

wrapped.seek(0)

try:
    dfpd = pd.read_csv(wrapped, sep=",")
    print(dfpd)
except Exception as e:
    print("pandas failed!")
    print(e)

shape: (2, 2)
┌──────┬──────┐
│ spam ┆ egg  │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ spam ┆ �uf  │
│ spam ┆ αυγό │
└──────┴──────┘
   spam   egg
0  spam   �uf
1  spam  αυγό