Let's say I have this:
>>> df = pl.DataFrame(dict(j=numpy.random.randint(10, 99, 20)))
>>> df
shape: (20, 1)
┌─────┐
│ j │
│ --- │
│ i64 │
╞═════╡
│ 47 │
│ 22 │
│ 82 │
│ 19 │
│ … │
│ 28 │
│ 94 │
│ 21 │
│ 38 │
└─────┘
>>> df.get_column('j').hist([10, 20, 30, 50])
shape: (5, 3)
┌─────────────┬──────────────┬─────────┐
│ break_point ┆ category ┆ j_count │
│ --- ┆ --- ┆ --- │
│ f64 ┆ cat ┆ u32 │
╞═════════════╪══════════════╪═════════╡
│ 10.0 ┆ (-inf, 10.0] ┆ 0 │
│ 20.0 ┆ (10.0, 20.0] ┆ 4 │
│ 30.0 ┆ (20.0, 30.0] ┆ 5 │
│ 50.0 ┆ (30.0, 50.0] ┆ 3 │
│ inf ┆ (50.0, inf] ┆ 8 │
└─────────────┴──────────────┴─────────┘
How would I go with doing something with the category
column? For example, how would I filter values where category has -inf
or where upper bound is between 10.0
and 30.0
or something along those lines?
Update: There is a feature request to change .hist()
and make it return a struct of values instead.
Perhaps there is a better way, but as the output seems to be structured, you could parse it using expressions?
hist = df.get_column('j').hist([10, 20, 30, 50])
hist.with_columns(
pl.col('category').cast(pl.String)
.str.strip_chars('(]')
.str.splitn(', ', 2)
.struct.rename_fields(['lower', 'upper'])
.struct.field('*')
.cast(pl.Float64)
)
shape: (5, 5)
┌────────────┬──────────────┬───────┬───────┬───────┐
│ breakpoint ┆ category ┆ count ┆ lower ┆ upper │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ cat ┆ u32 ┆ f64 ┆ f64 │
╞════════════╪══════════════╪═══════╪═══════╪═══════╡
│ 10.0 ┆ (-inf, 10.0] ┆ 0 ┆ -inf ┆ 10.0 │
│ 20.0 ┆ (10.0, 20.0] ┆ 2 ┆ 10.0 ┆ 20.0 │
│ 30.0 ┆ (20.0, 30.0] ┆ 4 ┆ 20.0 ┆ 30.0 │
│ 50.0 ┆ (30.0, 50.0] ┆ 5 ┆ 30.0 ┆ 50.0 │
│ inf ┆ (50.0, inf] ┆ 9 ┆ 50.0 ┆ inf │
└────────────┴──────────────┴───────┴───────┴───────┘