I have a outer/inner loop-based function I'm trying to vectorise using Python Polars DataFrames. The function is a type of moving average and will be used to filter time-series financial data. Here's the function:
def ma_j(df_src: pl.DataFrame, depth: float):
jrc04 = 0.0
jrc05 = 0.0
jrc06 = 0.0
jrc08 = 0.0
series = df_src['close']
for x in range(0, len(series)):
if x >= x - depth*2:
for k in np.arange(start=math.ceil(depth), stop=0, step=-1):
jrc04 = jrc04 + abs(series[x-k] - series[x-(k+1)])
jrc05 = jrc05 + (depth + k) * abs(series[x-k] - series[x-(k+1)])
jrc06 = jrc06 + series[x-(k+1)]
else:
jrc03 = abs(series - (series[1]))
jrc13 = abs(series[x-depth] - series[x - (depth+1)])
jrc04 = jrc04 - jrc13 + jrc03
jrc05 = jrc05 - jrc04 + jrc03 * depth
jrc06 = jrc06 - series[x - (depth+1)] + series[x-1]
jrc08 = abs(depth * series[x] - jrc06)
if jrc05 == 0.0:
ma = 0.0
else:
ma = jrc08/jrc05
return ma
The tricky bit for me are multiple the inner loop look-backs (for k in...). I've looked through multiple examples that use group_by_dynamic
on the timeseries
data. For example, here. I've also seen an example for rolling
, but this still seems to use a period.
However, I'd like to strip away the timeseries
and just use source Series
. Does this mean I need to group on an integer range?
Using this data example:
import polars as pl
import numpy as np
i, t, v = np.arange(0, 50, 1), np.arange(0, 100, 2), np.random.randint(1,101,50)
df = pl.DataFrame({"i": i, "t": t, "rand": v})
df = df.with_columns((pl.datetime(2022,10,30) + pl.duration(seconds=df["t"])).alias("datetime")).drop("t")
cols = ["i", "datetime", "rand"]
df = df.select(cols)
DataFrame looks like this:
shape: (50, 3)
┌─────┬─────────────────────┬──────┐
│ i ┆ datetime ┆ rand │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ i64 │
╞═════╪═════════════════════╪══════╡
│ 0 ┆ 2022-10-30 00:00:00 ┆ 87 │
│ 1 ┆ 2022-10-30 00:00:02 ┆ 66 │
│ 2 ┆ 2022-10-30 00:00:04 ┆ 30 │
│ 3 ┆ 2022-10-30 00:00:06 ┆ 87 │
│ 4 ┆ 2022-10-30 00:00:08 ┆ 74 │
│ … ┆ … ┆ … │
│ 45 ┆ 2022-10-30 00:01:30 ┆ 91 │
│ 46 ┆ 2022-10-30 00:01:32 ┆ 52 │
│ 47 ┆ 2022-10-30 00:01:34 ┆ 68 │
│ 48 ┆ 2022-10-30 00:01:36 ┆ 26 │
│ 49 ┆ 2022-10-30 00:01:38 ┆ 99 │
└─────┴─────────────────────┴──────┘
...I can do a grouping by datetime like this":
df.group_by_dynamic("datetime", every="10s").agg(
pl.col("rand").mean().alias('rolling mean')
)
But there's 3 issues with this:
i
?) in bins of [x] size.Any tips on how I could attack this using Polars? Thanks.
Following @ritchie46 's awesome advice (thanks mate!), here's the groupby:
result_grp = (
df
.rolling(index_column="i", period="10i")
.agg(
pl.len().alias("rolling_slots"),
pl.col("rand").mean().alias("roll_mean")
)
)
df2 = df.select(
pl.all(),
result_grp.get_column("rolling_slots"),
result_grp.get_column("roll_mean"),
)
This now gives:
shape: (50, 5)
┌─────┬─────────────────────┬──────┬───────────────┬───────────┐
│ i ┆ datetime ┆ rand ┆ rolling_slots ┆ roll_mean │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ i64 ┆ u32 ┆ f64 │
╞═════╪═════════════════════╪══════╪═══════════════╪═══════════╡
│ 0 ┆ 2022-10-30 00:00:00 ┆ 23 ┆ 1 ┆ 23.0 │
│ 1 ┆ 2022-10-30 00:00:02 ┆ 72 ┆ 2 ┆ 47.5 │
│ 2 ┆ 2022-10-30 00:00:04 ┆ 46 ┆ 3 ┆ 47.0 │
│ 3 ┆ 2022-10-30 00:00:06 ┆ 37 ┆ 4 ┆ 44.5 │
│ 4 ┆ 2022-10-30 00:00:08 ┆ 12 ┆ 5 ┆ 38.0 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 45 ┆ 2022-10-30 00:01:30 ┆ 95 ┆ 10 ┆ 53.7 │
│ 46 ┆ 2022-10-30 00:01:32 ┆ 100 ┆ 10 ┆ 62.7 │
│ 47 ┆ 2022-10-30 00:01:34 ┆ 6 ┆ 10 ┆ 62.2 │
│ 48 ┆ 2022-10-30 00:01:36 ┆ 27 ┆ 10 ┆ 56.5 │
│ 49 ┆ 2022-10-30 00:01:38 ┆ 33 ┆ 10 ┆ 54.5 │
└─────┴─────────────────────┴──────┴───────────────┴───────────┘
This is great; now instead of mean(), how do I apply a custom function on the grouped values, such as:
f_jparams(depth_array, jrc04, jrc05, jrc06, jrc08):
_depth = len(depth_array)
if len(depth_array) > 3:
for x in np.arange(start=1, stop=len(depth_array), step=1):
jrc04 = jrc04 + abs(depth_array[x] - depth_array[x-1])
jrc05 = jrc05 + (_depth+x) * abs(depth_array[x] - depth_array[x-1])
jrc06 = jrc06 + depth_array[x-1]
else:
jrc03 = abs(depth_array[_depth-1] - depth_array[_depth-2])
jrc13 = abs(depth_array[0] - depth_array[1])
jrc04 = jrc04 - jrc13 + jrc03
jrc05 = jrc05 - jrc04 + jrc03*_depth
jrc06 = jrc06 - depth_array[1] + depth_array[_depth-2]
jrc08 = abs(_depth * depth_array[0] - jrc06)
if jrc05 == 0.0:
ma = 0.0
else:
ma = jrc08/jrc05
return ma, jrc04, jrc05, jrc06, jrc08
Thanks!
Thanks to this post, I can collect up the items in the rand
rolling group into a list for each row:
depth = 10
result_grp = (
df
.rolling(
index_column="i",
period=str(depth) + "i",
# offset="0i",
# closed="left"
)
.agg(
pl.len().alias("rolling_slots"),
pl.col("rand").mean().alias("roll_mean"),
pl.col("rand").name.suffix('_val_list'),
)
)
df2 = df.select(
pl.all(),
result_grp.get_column("rolling_slots"),
result_grp.get_column("roll_mean"),
result_grp.get_column("rand_val_list"),
)
Also from this post, I saw a way to make the rolling window period
a variable; nice!
Is there a way to use get_columns
and exclude
together so I don't have to list every col I want?
The dataframe now looks like:
shape: (50, 6)
┌─────┬─────────────────────┬──────┬───────────────┬───────────┬─────────────────┐
│ i ┆ datetime ┆ rand ┆ rolling_slots ┆ roll_mean ┆ rand_val_list │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ i64 ┆ u32 ┆ f64 ┆ list[i64] │
╞═════╪═════════════════════╪══════╪═══════════════╪═══════════╪═════════════════╡
│ 0 ┆ 2022-10-30 00:00:00 ┆ 23 ┆ 1 ┆ 23.0 ┆ [23] │
│ 1 ┆ 2022-10-30 00:00:02 ┆ 72 ┆ 2 ┆ 47.5 ┆ [23, 72] │
│ 2 ┆ 2022-10-30 00:00:04 ┆ 46 ┆ 3 ┆ 47.0 ┆ [23, 72, 46] │
│ 3 ┆ 2022-10-30 00:00:06 ┆ 37 ┆ 4 ┆ 44.5 ┆ [23, 72, … 37] │
│ 4 ┆ 2022-10-30 00:00:08 ┆ 12 ┆ 5 ┆ 38.0 ┆ [23, 72, … 12] │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 45 ┆ 2022-10-30 00:01:30 ┆ 95 ┆ 10 ┆ 53.7 ┆ [10, 11, … 95] │
│ 46 ┆ 2022-10-30 00:01:32 ┆ 100 ┆ 10 ┆ 62.7 ┆ [11, 84, … 100] │
│ 47 ┆ 2022-10-30 00:01:34 ┆ 6 ┆ 10 ┆ 62.2 ┆ [84, 53, … 6] │
│ 48 ┆ 2022-10-30 00:01:36 ┆ 27 ┆ 10 ┆ 56.5 ┆ [53, 46, … 27] │
│ 49 ┆ 2022-10-30 00:01:38 ┆ 33 ┆ 10 ┆ 54.5 ┆ [46, 56, … 33] │
└─────┴─────────────────────┴──────┴───────────────┴───────────┴─────────────────┘
Should I just now resort back to looping through the rand_val_list
column and send each grouped values list to my function? Or is there a better polars
way?
Thanks again!
Are you searching for periods="10i"
?
Polars rolling
accepts a period
argument with the following query language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Where i
is simply the number of indices/rows.
So on your data a rolling group_by where we count the number of slots would give:
(df.rolling(index_column="i", period="10i")
.agg(
pl.len().alias("rolling_slots")
)
)
shape: (50, 2)
┌─────┬───────────────┐
│ i ┆ rolling_slots │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞═════╪═══════════════╡
│ 0 ┆ 1 │
│ 1 ┆ 2 │
│ 2 ┆ 3 │
│ 3 ┆ 4 │
│ 4 ┆ 5 │
│ … ┆ … │
│ 45 ┆ 10 │
│ 46 ┆ 10 │
│ 47 ┆ 10 │
│ 48 ┆ 10 │
│ 49 ┆ 10 │
└─────┴───────────────┘