Say I have this:
numpy.random.seed(1)
df = (polars
.DataFrame(dict(
dt=numpy.random.randint(datetime.datetime(2023, 1, 1).timestamp(), datetime.datetime(2023, 1, 31).timestamp(), 100)
))
.select(
polars.from_epoch('dt').sort()
)
.filter(
(polars.col('dt').dt.day().cos() * 17).floor() % 3 == 0
)
.pipe(lambda df: df.with_columns(
j=polars.lit(numpy.random.randint(10, 99, df.height)),
))
.with_columns(
k=polars.col('j').last().over(polars.col('dt').dt.date()),
)
)
which produces:
dt (datetime[μs]) j (i64) k (i64)
2023-01-01 12:48:34 30 30
2023-01-04 09:37:05 42 75
2023-01-04 15:13:42 22 75
2023-01-04 22:58:20 75 75
2023-01-07 00:18:27 70 20
2023-01-07 02:42:28 34 20
2023-01-07 06:32:09 92 20
2023-01-07 09:38:43 12 20
2023-01-07 20:59:16 20 20
2023-01-08 05:25:04 64 76
2023-01-08 09:10:17 92 76
2023-01-08 10:53:40 96 76
2023-01-08 14:29:28 80 76
2023-01-08 16:11:37 76 76
2023-01-10 12:59:38 81 58
2023-01-10 21:21:29 58 58
2023-01-11 14:33:55 64 52
2023-01-11 18:54:01 25 52
2023-01-11 22:00:55 15 52
2023-01-11 22:34:28 27 52
2023-01-11 23:41:27 52 52
2023-01-13 04:07:50 30 23
2023-01-13 08:20:19 58 23
2023-01-13 09:44:07 32 23
2023-01-13 14:18:54 23 23
2023-01-20 08:42:19 63 94
2023-01-20 17:07:37 94 94
shape: (27, 3)
where I've calculated k
to be the last value of j
for each of the dates. I'd like to add another column l
which would have the value of k
for the previous available date, i.e.:
dt (date) j (i64) k (i64) l (i64)
2023-01-01 30 30 null
2023-01-04 42 75 30
2023-01-04 22 75 30
2023-01-04 75 75 30
2023-01-07 70 20 75
2023-01-07 34 20 75
2023-01-07 92 20 75
2023-01-07 12 20 75
2023-01-07 20 20 75
2023-01-08 64 76 20
2023-01-08 92 76 20
2023-01-08 96 76 20
2023-01-08 80 76 20
2023-01-08 76 76 20
2023-01-10 81 58 76
2023-01-10 58 58 76
2023-01-11 64 52 58
2023-01-11 25 52 58
2023-01-11 15 52 58
2023-01-11 27 52 58
2023-01-11 52 52 58
2023-01-13 30 23 52
2023-01-13 58 23 52
2023-01-13 32 23 52
2023-01-13 23 23 52
2023-01-20 63 94 23
2023-01-20 94 94 23
shape: (27, 4)
That is, instead of "shift by a constant number of rows" which the standard shift function provides, I'm looking for a "shift by date" functionality. How would I go about doing that?
Can you .last().over(day).shift()
+ .first().over(day)
?
df.with_columns(
l = pl.col('k').last().over(pl.col('dt').dt.truncate('1d')).shift()
).with_columns(
pl.first('l').over(pl.col('dt').dt.truncate('1d'))
)
shape: (27, 4)
┌─────────────────────┬─────┬─────┬──────┐
│ dt ┆ j ┆ k ┆ l │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════════╪═════╪═════╪══════╡
│ 2023-01-01 12:48:34 ┆ 30 ┆ 30 ┆ null │
│ 2023-01-04 09:37:05 ┆ 42 ┆ 75 ┆ 30 │
│ 2023-01-04 15:13:42 ┆ 22 ┆ 75 ┆ 30 │
│ 2023-01-04 22:58:20 ┆ 75 ┆ 75 ┆ 30 │
│ 2023-01-07 00:18:27 ┆ 70 ┆ 20 ┆ 75 │
│ 2023-01-07 02:42:28 ┆ 34 ┆ 20 ┆ 75 │
│ 2023-01-07 06:32:09 ┆ 92 ┆ 20 ┆ 75 │
│ 2023-01-07 09:38:43 ┆ 12 ┆ 20 ┆ 75 │
│ 2023-01-07 20:59:16 ┆ 20 ┆ 20 ┆ 75 │
│ 2023-01-08 05:25:04 ┆ 64 ┆ 76 ┆ 20 │
│ 2023-01-08 09:10:17 ┆ 92 ┆ 76 ┆ 20 │
│ 2023-01-08 10:53:40 ┆ 96 ┆ 76 ┆ 20 │
│ 2023-01-08 14:29:28 ┆ 80 ┆ 76 ┆ 20 │
│ 2023-01-08 16:11:37 ┆ 76 ┆ 76 ┆ 20 │
│ 2023-01-10 12:59:38 ┆ 81 ┆ 58 ┆ 76 │
│ 2023-01-10 21:21:29 ┆ 58 ┆ 58 ┆ 76 │
│ 2023-01-11 14:33:55 ┆ 64 ┆ 52 ┆ 58 │
│ 2023-01-11 18:54:01 ┆ 25 ┆ 52 ┆ 58 │
│ 2023-01-11 22:00:55 ┆ 15 ┆ 52 ┆ 58 │
│ 2023-01-11 22:34:28 ┆ 27 ┆ 52 ┆ 58 │
│ 2023-01-11 23:41:27 ┆ 52 ┆ 52 ┆ 58 │
│ 2023-01-13 04:07:50 ┆ 30 ┆ 23 ┆ 52 │
│ 2023-01-13 08:20:19 ┆ 58 ┆ 23 ┆ 52 │
│ 2023-01-13 09:44:07 ┆ 32 ┆ 23 ┆ 52 │
│ 2023-01-13 14:18:54 ┆ 23 ┆ 23 ┆ 52 │
│ 2023-01-20 08:42:19 ┆ 63 ┆ 94 ┆ 23 │
│ 2023-01-20 17:07:37 ┆ 94 ┆ 94 ┆ 23 │
└─────────────────────┴─────┴─────┴──────┘