Update: This issue is no longer present in Polars. Datetime means are calculated as expected.
I have a dataframe with a column of datetimes, a column of floats, and a column of integers like this:
┌─────────────────────────┬───────────┬─────────────┐
│ time ┆ NAV_DEPTH ┆ coarse_ints │
│ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ f64 ┆ i64 │
╞═════════════════════════╪═══════════╪═════════════╡
│ 2019-07-21 23:25:02.737 ┆ 3.424 ┆ 0 │
│ 2019-07-21 23:25:32.745 ┆ 2.514 ┆ 0 │
│ 2019-07-21 23:26:02.753 ┆ 2.514 ┆ 0 │
│ 2019-07-21 23:26:32.668 ┆ 2.323 ┆ 0 │
│ 2019-07-23 21:24:16.383 ┆ 3.17 ┆ 689 │
│ 2019-07-23 21:24:46.390 ┆ 3.213 ┆ 689 │
│ 2019-07-23 21:25:16.396 ┆ 3.361 ┆ 689 │
│ 2019-07-23 21:25:46.402 ┆ 3.403 ┆ 690 │
└─────────────────────────┴───────────┴─────────────┘
The integer column serves to split the dataset up into sequential groups of 8 samples for averaging. I would like to perform a groupby on the integer column and get the mean depth and datetime for each integer. It works with median
df.group_by('coarse_ints').median()
┌─────────────┬─────────────────────────┬───────────┐
│ coarse_ints ┆ time ┆ NAV_DEPTH │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ms] ┆ f64 │
╞═════════════╪═════════════════════════╪═══════════╡
│ 689 ┆ 2019-07-23 21:24:46.390 ┆ 3.213 │
│ 690 ┆ 2019-07-23 21:25:46.402 ┆ 3.403 │
│ 0 ┆ 2019-07-21 23:25:47.749 ┆ 2.514 │
└─────────────┴─────────────────────────┴───────────┘
But with mean, the datetimes all go to null
df.group_by('coarse_ints').mean()
┌─────────────┬──────────────┬───────────┐
│ coarse_ints ┆ time ┆ NAV_DEPTH │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ms] ┆ f64 │
╞═════════════╪══════════════╪═══════════╡
│ 0 ┆ null ┆ 2.69375 │
│ 690 ┆ null ┆ 3.403 │
│ 689 ┆ null ┆ 3.248 │
└─────────────┴──────────────┴───────────┘
group_by_dynamic
looked promising, but it needs a regular time interval. I need to average every 8 samples, regardless of the irregular time interval.
If you operate on the underlying integer representation of the datetime, then cast back when done, you can get the mean
via a regular group_by
(I admit this is slightly non-intuitive ;)
df.with_columns(
pl.col('time').to_physical()
).group_by(
pl.col('coarse_ints'),
maintain_order = True # or not :)
).mean().with_columns(
pl.col('time').cast( pl.Datetime('ms') )
)
Note that casting back from the physical/integer representation should include the original timeunit (eg: 'ms','us','ns') so as to avoid potentially incorrect scaling.
Update: for anyone coming to this answer now, we have supported the full set of suitable aggregate functions on temporal columns for a while, so the new answer is just "use mean()
as normal":
df.group_by(
pl.col('coarse_ints'),
maintain_order = True # or not :)
).mean()