Consider the following snippet:
from io import TextIOWrapper, BytesIO
import polars as pl
import pandas as pd
csv_str = (
b"spam,egg\n"
+ "spam,œuf\n".encode("cp1252")
+ "spam,αυγό\n".encode("utf8")
)
content = BytesIO(csv_str)
wrapped = TextIOWrapper(content, errors="replace")
try:
df = pl.read_csv(wrapped)
except Exception as e:
print("polars failed!")
print(e)
wrapped.seek(0)
try:
df = pd.read_csv(wrapped, sep=",")
except Exception as e:
print("pandas failed!")
print(e)
You got there an invalid CSV a bad as there is, with two different encodings. Strangely enough, this keeps to be a real-life problem, and a too frequent one.
With pandas
, you can either handle this through the TextIOWrapper
or the built-in encoding_errors
argument.
Questions:
polars
, considering that the TextIOWrapper
should handle this input as a stream?polars
(I mean any way other than reading it with pandas
then converting it with polars.from_pandas
)?Use pl.read_csv(wrapped, encoding='utf8-lossy')
.
help(pl.read_csv)
encoding : {'utf8', 'utf8-lossy', ...} Lossy means that invalid utf8 values are replaced with `�` characters. When using other encodings than `utf8` or `utf8-lossy`, the input is first decoded in memory with python. Defaults to `utf8`.
Improved script:
from io import TextIOWrapper, BytesIO
import polars as pl
import pandas as pd
csv_str = (
b"spam,egg\n"
+ "spam,œuf\n".encode("cp1252")
+ "spam,αυγό\n".encode("utf8")
)
content = BytesIO(csv_str)
wrapped = TextIOWrapper(content, errors="replace")
try:
dfpl = pl.read_csv(wrapped, encoding='utf8-lossy')
print(dfpl)
except Exception as e:
print("polars failed!")
print(e)
wrapped.seek(0)
try:
dfpd = pd.read_csv(wrapped, sep=",")
print(dfpd)
except Exception as e:
print("pandas failed!")
print(e)
shape: (2, 2) ┌──────┬──────┐ │ spam ┆ egg │ │ --- ┆ --- │ │ str ┆ str │ ╞══════╪══════╡ │ spam ┆ �uf │ │ spam ┆ αυγό │ └──────┴──────┘ spam egg 0 spam �uf 1 spam αυγό