Search code examples
python-polars

How to form dynamic expressions without breaking on types


Any way to make the dynamic polars expressions not break with errors?

Currently I'm just excluding the columns by type, but just wondering if there is a better way.

For example, i have a df coming from parquet, if i just execute an expression on all columns it might break for certain types. Instead I want to contain these errors and possibly return a default value like None or -1 or something else.

import polars as pl

df = pl.scan_parquet("/path/to/data/*.parquet")

print(df.schema)
# Prints: {'date_time': <class 'polars.datatypes.Datetime'>, 'incident': <class 'polars.datatypes.Utf8'>, 'address': <class 'polars.datatypes.Utf8'>, 'city': <class 'polars.datatypes.Utf8'>, 'zipcode': <class 'polars.datatypes.Int32'>}

Now if i form generic expression on top of this, there are chances it may fail. For example,

# Finding positive count across all columns
# Fails due to: exceptions.ComputeError: cannot compare Utf8 with numeric data
print(df.select((pl.all() > 0).count().name.prefix("__positive_count_")).collect())


# Finding positive count across all columns
# Fails due to: pyo3_runtime.PanicException: 'unique_counts' not implemented for datetime[ns] data types
print(df.select(pl.all().unique_counts().name.prefix("__unique_count_")).collect())


# Finding positive count across all columns
# Fails due to: exceptions.SchemaError: Series dtype Int32 != utf8
# Note: this could have been avoided by doing an explict cast to string first
print(df.select((pl.all().str.len_chars() > 0).count().name.prefix("__empty_count_")).collect())

Solution

  • I'll keep to things that work in lazy mode, as it appears that you are working in lazy mode with Parquet files.

    Let's use this data as an example:

    import polars as pl
    from datetime import datetime
    
    df = pl.DataFrame(
        {
            "col_int": [-2, -2, 0, 2, 2],
            "col_float": [-20.0, -10, 10, 20, 20],
            "col_date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 5, 1), "1mo", eager=True),
            "col_str": ["str1", "str2", "", None, "str5"],
            "col_bool": [True, False, False, True, False],
        }
    ).lazy()
    df.collect()
    
    shape: (5, 5)
    ┌─────────┬───────────┬─────────────────────┬─────────┬──────────┐
    │ col_int ┆ col_float ┆ col_date            ┆ col_str ┆ col_bool │
    │ ---     ┆ ---       ┆ ---                 ┆ ---     ┆ ---      │
    │ i64     ┆ f64       ┆ datetime[ns]        ┆ str     ┆ bool     │
    ╞═════════╪═══════════╪═════════════════════╪═════════╪══════════╡
    │ -2      ┆ -20.0     ┆ 2020-01-01 00:00:00 ┆ str1    ┆ true     │
    │ -2      ┆ -10.0     ┆ 2020-02-01 00:00:00 ┆ str2    ┆ false    │
    │ 0       ┆ 10.0      ┆ 2020-03-01 00:00:00 ┆         ┆ false    │
    │ 2       ┆ 20.0      ┆ 2020-04-01 00:00:00 ┆ null    ┆ true     │
    │ 2       ┆ 20.0      ┆ 2020-05-01 00:00:00 ┆ str5    ┆ false    │
    └─────────┴───────────┴─────────────────────┴─────────┴──────────┘
    

    Using the col Expression

    One feature of the col expression is that you can supply a datatype, or even a list of datatypes. For example, if we want to contain our queries to floats, we can do the following:

    df.select((pl.col(pl.Float64) > 0).sum().name.suffix("__positive_count_")).collect()
    
    shape: (1, 1)
    ┌────────────────────────────┐
    │ col_float__positive_count_ │
    │ ---                        │
    │ u32                        │
    ╞════════════════════════════╡
    │ 3                          │
    └────────────────────────────┘
    

    (Note: (pl.col(...) > 0) creates a series of boolean values that need to be summed, not counted)

    To include more than one datatype, you can supply a list of datatypes to col.

    df.select(
        (pl.col([pl.Int64, pl.Float64]) > 0).sum().name.suffix("__positive_count_")
    ).collect()
    
    shape: (1, 2)
    ┌──────────────────────────┬────────────────────────────┐
    │ col_int__positive_count_ ┆ col_float__positive_count_ │
    │ ---                      ┆ ---                        │
    │ u32                      ┆ u32                        │
    ╞══════════════════════════╪════════════════════════════╡
    │ 2                        ┆ 3                          │
    └──────────────────────────┴────────────────────────────┘
    

    You can also combine these into the same select statement if you'd like.

    df.select(
        (pl.col(pl.String).str.len_chars() == 0).sum().name.suffix("__empty_count"),
        pl.col(pl.String).is_null().sum().name.suffix("__null_count"),
        (pl.col([pl.Float64, pl.Int64]) > 0).sum().name.suffix("_positive_count"),
    ).collect()
    
    shape: (1, 4)
    ┌──────────────────────┬─────────────────────┬──────────────────────────┬────────────────────────┐
    │ col_str__empty_count ┆ col_str__null_count ┆ col_float_positive_count ┆ col_int_positive_count │
    │ ---                  ┆ ---                 ┆ ---                      ┆ ---                    │
    │ u32                  ┆ u32                 ┆ u32                      ┆ u32                    │
    ╞══════════════════════╪═════════════════════╪══════════════════════════╪════════════════════════╡
    │ 1                    ┆ 1                   ┆ 3                        ┆ 2                      │
    └──────────────────────┴─────────────────────┴──────────────────────────┴────────────────────────┘
    

    The Cookbook has a handy list of datatypes.

    Using the exclude expression

    Another handy trick is to use the exclude expression. With this, we can select all columns except columns of certain datatypes. For example:

    df.select(
        pl.exclude(pl.String).max().name.suffix("_max"),
        pl.exclude(pl.Utf8, pl.Boolean).min().name.suffix("_min"),
    ).collect()
    
    shape: (1, 7)
    ┌─────────────┬───────────────┬─────────────────────┬──────────────┬─────────────┬───────────────┬─────────────────────┐
    │ col_int_max ┆ col_float_max ┆ col_date_max        ┆ col_bool_max ┆ col_int_min ┆ col_float_min ┆ col_date_min        │
    │ ---         ┆ ---           ┆ ---                 ┆ ---          ┆ ---         ┆ ---           ┆ ---                 │
    │ i64         ┆ f64           ┆ datetime[ns]        ┆ u32          ┆ i64         ┆ f64           ┆ datetime[ns]        │
    ╞═════════════╪═══════════════╪═════════════════════╪══════════════╪═════════════╪═══════════════╪═════════════════════╡
    │ 2           ┆ 20.0          ┆ 2020-05-01 00:00:00 ┆ 1            ┆ -2          ┆ -20.0         ┆ 2020-01-01 00:00:00 │
    └─────────────┴───────────────┴─────────────────────┴──────────────┴─────────────┴───────────────┴─────────────────────┘
    

    Unique counts

    One caution: unique_counts results in Series of varying lengths.

    df.select(pl.col("col_int").unique_counts().name.prefix(
        "__unique_count_")).collect()
    
    shape: (3, 1)
    ┌────────────────────────┐
    │ __unique_count_col_int │
    │ ---                    │
    │ u32                    │
    ╞════════════════════════╡
    │ 2                      │
    │ 1                      │
    │ 2                      │
    └────────────────────────┘
    
    df.select(pl.col("col_float").unique_counts().name.prefix(
        "__unique_count_")).collect()
    
    shape: (4, 1)
    ┌──────────────────────────┐
    │ __unique_count_col_float │
    │ ---                      │
    │ u32                      │
    ╞══════════════════════════╡
    │ 1                        │
    │ 1                        │
    │ 1                        │
    │ 2                        │
    └──────────────────────────┘
    

    As such, these should not be combined into the same results. Each column/Series of a DataFrame must have the same length.