Search code examples
pandaspython-polars

python-polars casting string to numeric


When applying pandas.to_numeric,Pandas return dtype is float64 or int64 depending on the data supplied.https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

is there an equivelent to do this in polars?

I have seen this How to cast a column with data type List[null] to List[i64] in polars however dont want to individually cast each column. got couple of string columns i want to turn numeric. this could be int or float values

#code to show casting in pandas.to_numeric
import pandas as pd
df = pd.DataFrame({"col1":["1","2"], "col2":["3.5", "4.6"]})
print("DataFrame:")
print(df)
df[["col1","col2"]]=df[["col1","col2"]].apply(pd.to_numeric)
print(df.dtypes)

Solution

  • Unlike Pandas, Polars is quite picky about datatypes and tends to be rather unaccommodating when it comes to automatic casting. (Among the reasons is performance.)

    You can create a feature request for a to_numeric method (but I'm not sure how enthusiastic the response will be.)

    That said, here's some easy ways to accomplish this.

    Create a method

    Perhaps the simplest way is to write a method that attempts the cast to integer and then catches the exception. For convenience, you can even attach this method to the Series class itself.

    def to_numeric(s: pl.Series) -> pl.Series:
        try:
            result = s.cast(pl.Int64)
        except pl.exceptions.InvalidOperationError:
            result = s.cast(pl.Float64)
        return result
    
    
    pl.Series.to_numeric = to_numeric
    

    Then to use it:

    (
        pl.select(
            s.to_numeric()
            for s in df
        )
    )
    
    shape: (2, 2)
    ┌──────┬──────┐
    │ col1 ┆ col2 │
    │ ---  ┆ ---  │
    │ i64  ┆ f64  │
    ╞══════╪══════╡
    │ 1    ┆ 3.5  │
    │ 2    ┆ 4.6  │
    └──────┴──────┘
    

    Use the automatic casting of csv parsing

    Another method is to write your columns to a csv file (in a string buffer), and then have read_csv try to infer the types automatically. You may have to tweak the infer_schema_length parameter in some situations.

    from io import StringIO
    pl.read_csv(StringIO(df.write_csv()))
    
    shape: (2, 2)
    ┌──────┬──────┐
    │ col1 ┆ col2 │
    │ ---  ┆ ---  │
    │ i64  ┆ f64  │
    ╞══════╪══════╡
    │ 1    ┆ 3.5  │
    │ 2    ┆ 4.6  │
    └──────┴──────┘