I have a tsv file that contains integers with thousand separators. I'm trying to read it using polars==1.6.0
, the encoding is utf-16
from io import BytesIO
import polars as pl
data = BytesIO(
"""
Id\tA\tB
1\t537\t2,288
2\t325\t1,047
3\t98\t194
""".encode("utf-16")
)
df = pl.read_csv(data, encoding="utf-16", separator="\t")
print(df)
I cannot figure out how to get polars to treat column "B" as integer rather than string, and I also cannot find a clean way of casting it to an integer.
shape: (3, 3)
┌────────┬─────┬───────┐
│ Id ┆ A ┆ B │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞════════╪═════╪═══════╡
│ 1 ┆ 537 ┆ 2,288 │
│ 2 ┆ 325 ┆ 1,047 │
│ 3 ┆ 98 ┆ 194 │
└────────┴─────┴───────┘
cast fails, as does passing the schema explicitly. I also tried using str.strip_chars
and to remove the comma, my work-around is to use str.replace_all
instead.
df = df.with_columns(
pl.col("B").str.strip_chars(",").alias("B_strip_chars"),
pl.col("B").str.replace_all("[^0-9]", "").alias("B_replace"),
)
print(df)
shape: (3, 5)
┌────────┬─────┬───────┬───────────────┬───────────┐
│ Id ┆ A ┆ B ┆ B_strip_chars ┆ B_replace │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ str ┆ str │
╞════════╪═════╪═══════╪═══════════════╪═══════════╡
│ 1 ┆ 537 ┆ 2,288 ┆ 2,288 ┆ 2288 │
│ 2 ┆ 325 ┆ 1,047 ┆ 1,047 ┆ 1047 │
│ 3 ┆ 98 ┆ 194 ┆ 194 ┆ 194 │
└────────┴─────┴───────┴───────────────┴───────────┘
Also for this to work in general I'd need to ensure that read_csv
doesn't try and infer types for any columns so I can convert them all manually (any numeric column with a value > 999 will contain a comma)
To allow for possible multiple ,
separators use .str.replace_all:
df = df.with_columns(pl.col('B').str.replace_all(",", "").cast(pl.Int64))
which gives for the sample data:
shape: (3, 3)
┌─────┬─────┬──────┐
│ Id ┆ A ┆ B │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪══════╡
│ 1 ┆ 537 ┆ 2288 │
│ 2 ┆ 325 ┆ 1047 │
│ 3 ┆ 98 ┆ 194 │
└─────┴─────┴──────┘