I have some polars code that functionally can do what I want, but I feel it is an inefficient implementation at best. I feel that there must be some way to achieve the same result with .map_batches()
, but I can't figure out how. Any thoughts or suggestions?
Specifically, my data are organized as follows: Each column is a location, and each row is a datetime. What I'm trying to do is calculate the maximum count of consecutive non-zero values (which I converted to Booleans because I don't need the magnitude of the value, I just need to know if the value is zero or not). Example data and example expected output below:
Example Dummy Data
pivoted_df = pl.from_repr("""
┌─────────────────────┬────────────┬────────────┐
│ Date ┆ Location 1 ┆ Location 2 │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ i64 ┆ i64 │
╞═════════════════════╪════════════╪════════════╡
│ 2023-01-01 00:00:00 ┆ 0 ┆ 1 │
│ 2023-01-01 01:00:00 ┆ 1 ┆ 1 │
│ 2023-01-01 02:00:00 ┆ 1 ┆ 1 │
│ 2023-01-01 03:00:00 ┆ 0 ┆ 1 │
│ 2023-01-01 04:00:00 ┆ 1 ┆ 1 │
│ 2023-01-01 05:00:00 ┆ 1 ┆ 0 │
│ 2023-01-01 06:00:00 ┆ 1 ┆ 0 │
└─────────────────────┴────────────┴────────────┘
""")
Expected Output:
┌────────────┬───────┐
│ Location ┆ Value │
│ --- ┆ --- │
│ str ┆ i32 │
╞════════════╪═══════╡
│ Location 1 ┆ 3 │
│ Location 2 ┆ 5 │
└────────────┴───────┘
Below is the code I have that is functional, but feels like it can be improved my someone smarter and more well-versed in polars than I am.
for col in pivoted_df.drop("Date").columns:
xy_cont_df_a = (
pivoted_df.select(pl.col(col))
.with_columns(
pl.when(
pl.col(col).cast(pl.Boolean)
& pl.col(col)
.cast(pl.Boolean)
.shift(-1, fill_value=False)
.not_()
).then(
pl.count().over(
(
pl.col(col).cast(pl.Boolean)
!= pl.col(col).cast(pl.Boolean).shift()
).cum_sum()
)
)
)
.max()
)
Update: Polars 0.18.7
added .rle()
and .rle_id()
which simplify these types of tasks.
df.select(
pl.col('Location 1', 'Location 2').rle()
.struct['len'].max().name.keep()
)
shape: (1, 2)
┌────────────┬────────────┐
│ Location 1 ┆ Location 2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════════╪════════════╡
│ 3 ┆ 5 │
└────────────┴────────────┘
With unpivoted data:
df.group_by('Location').agg(
pl.col('Value').rle().struct['len'].max()
)
shape: (2, 2)
┌────────────┬───────┐
│ Location ┆ Value │
│ --- ┆ --- │
│ str ┆ i32 │
╞════════════╪═══════╡
│ Location 2 ┆ 5 │
│ Location 1 ┆ 3 │
└────────────┴───────┘