Search code examples
pythonpython-polars

Parsing numeric data with thousands seperator in `polars`


I have a tsv file that contains integers with thousand separators. I'm trying to read it using polars==1.6.0, the encoding is utf-16

from io import BytesIO
import polars as pl

data = BytesIO(
"""
Id\tA\tB
1\t537\t2,288
2\t325\t1,047
3\t98\t194
""".encode("utf-16")
)

df = pl.read_csv(data, encoding="utf-16", separator="\t")
print(df)

I cannot figure out how to get polars to treat column "B" as integer rather than string, and I also cannot find a clean way of casting it to an integer.

shape: (3, 3)
┌────────┬─────┬───────┐
│ Id     ┆ A   ┆ B     │
│ ---    ┆ --- ┆ ---   │
│ i64    ┆ i64 ┆ str   │
╞════════╪═════╪═══════╡
│ 1      ┆ 537 ┆ 2,288 │
│ 2      ┆ 325 ┆ 1,047 │
│ 3      ┆ 98  ┆ 194   │
└────────┴─────┴───────┘

cast fails, as does passing the schema explicitly. I also tried using str.strip_chars and to remove the comma, my work-around is to use str.replace_all instead.

df = df.with_columns(
    pl.col("B").str.strip_chars(",").alias("B_strip_chars"),
    pl.col("B").str.replace_all("[^0-9]", "").alias("B_replace"),
)
print(df)
shape: (3, 5)
┌────────┬─────┬───────┬───────────────┬───────────┐
│ Id     ┆ A   ┆ B     ┆ B_strip_chars ┆ B_replace │
│ ---    ┆ --- ┆ ---   ┆ ---           ┆ ---       │
│ i64    ┆ i64 ┆ str   ┆ str           ┆ str       │
╞════════╪═════╪═══════╪═══════════════╪═══════════╡
│ 1      ┆ 537 ┆ 2,288 ┆ 2,288         ┆ 2288      │
│ 2      ┆ 325 ┆ 1,047 ┆ 1,047         ┆ 1047      │
│ 3      ┆ 98  ┆ 194   ┆ 194           ┆ 194       │
└────────┴─────┴───────┴───────────────┴───────────┘

Also for this to work in general I'd need to ensure that read_csv doesn't try and infer types for any columns so I can convert them all manually (any numeric column with a value > 999 will contain a comma)


Solution

  • To allow for possible multiple , separators use .str.replace_all:

    df = df.with_columns(pl.col('B').str.replace_all(",", "").cast(pl.Int64))
    

    which gives for the sample data:

       shape: (3, 3)
    ┌─────┬─────┬──────┐
    │ Id  ┆ A   ┆ B    │
    │ --- ┆ --- ┆ ---  │
    │ i64 ┆ i64 ┆ i64  │
    ╞═════╪═════╪══════╡
    │ 1   ┆ 537 ┆ 2288 │
    │ 2   ┆ 325 ┆ 1047 │
    │ 3   ┆ 98  ┆ 194  │
    └─────┴─────┴──────┘