Search code examples
pythonpython-polars

How to use the interval values in a categorial column returned by `.hist()` in Polars?


Let's say I have this:

>>> df = pl.DataFrame(dict(j=numpy.random.randint(10, 99, 20)))
>>> df
shape: (20, 1)
┌─────┐
│ j   │
│ --- │
│ i64 │
╞═════╡
│ 47  │
│ 22  │
│ 82  │
│ 19  │
│ …   │
│ 28  │
│ 94  │
│ 21  │
│ 38  │
└─────┘
>>> df.get_column('j').hist([10, 20, 30, 50])
shape: (5, 3)
┌─────────────┬──────────────┬─────────┐
│ break_point ┆ category     ┆ j_count │
│ ---         ┆ ---          ┆ ---     │
│ f64         ┆ cat          ┆ u32     │
╞═════════════╪══════════════╪═════════╡
│ 10.0        ┆ (-inf, 10.0] ┆ 0       │
│ 20.0        ┆ (10.0, 20.0] ┆ 4       │
│ 30.0        ┆ (20.0, 30.0] ┆ 5       │
│ 50.0        ┆ (30.0, 50.0] ┆ 3       │
│ inf         ┆ (50.0, inf]  ┆ 8       │
└─────────────┴──────────────┴─────────┘

How would I go with doing something with the category column? For example, how would I filter values where category has -inf or where upper bound is between 10.0 and 30.0 or something along those lines?


Solution

  • Update: There is a feature request to change .hist() and make it return a struct of values instead.


    Perhaps there is a better way, but as the output seems to be structured, you could parse it using expressions?

    hist = df.get_column('j').hist([10, 20, 30, 50])
    
    hist.with_columns(
       pl.col('category').cast(pl.String)
         .str.strip_chars('(]')
         .str.splitn(', ', 2)
         .struct.rename_fields(['lower', 'upper'])
         .struct.field('*')
         .cast(pl.Float64)
    )
    
    shape: (5, 5)
    ┌────────────┬──────────────┬───────┬───────┬───────┐
    │ breakpoint ┆ category     ┆ count ┆ lower ┆ upper │
    │ ---        ┆ ---          ┆ ---   ┆ ---   ┆ ---   │
    │ f64        ┆ cat          ┆ u32   ┆ f64   ┆ f64   │
    ╞════════════╪══════════════╪═══════╪═══════╪═══════╡
    │ 10.0       ┆ (-inf, 10.0] ┆ 0     ┆ -inf  ┆ 10.0  │
    │ 20.0       ┆ (10.0, 20.0] ┆ 2     ┆ 10.0  ┆ 20.0  │
    │ 30.0       ┆ (20.0, 30.0] ┆ 4     ┆ 20.0  ┆ 30.0  │
    │ 50.0       ┆ (30.0, 50.0] ┆ 5     ┆ 30.0  ┆ 50.0  │
    │ inf        ┆ (50.0, inf]  ┆ 9     ┆ 50.0  ┆ inf   │
    └────────────┴──────────────┴───────┴───────┴───────┘