There is a dataset with columns ID
and Feature_1
. Feature_1
could be understood as a specific duration of session in secs. There is also a custom function, which calculates moving average with addition of simple average in the begining according to number of NaN's caused by window width. Here it is:
def moving_average_mit_anfang(x, w):
# First part - simple average
first_part_result = np.cumsum(x)/np.cumsum(np.ones(len(x)))
# If appearence of user's sessions is greater than window width, we calculate moving average
if len(x)>w:
# Second part - moving average with window w
sec_part_result = np.convolve(x, np.ones(w), 'valid') / w
return np.append(first_part_result[:-len(sec_part_result)],sec_part_result)
# Otherwise we calculate only simple average
else:
return first_part_result
We should apply this function on column Featrue_1
in such way, that we get current average value for each ID, according to appearence time of corresponding ID.
Example dataframe:
pd.DataFrame(data={'ID':[1,2,3,2,3,1,2,1,3,3,3,2,1],
'Feature_1':[4,5,6,73,2,21,13,45,32,9,18,45,39]})
I tried this:
test_df.groupby('ID')['Feature_1'].transform(lambda x: moving_average_mit_anfang(x,1))
And got this:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-38-6cc3e6c9b134> in <module>
----> 1 test_df.groupby('ID')['Feature_1'].transform(lambda x: moving_average_mit_anfang(x,1))
~/DS/RS/rs_env/lib/python3.8/site-packages/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
505
506 if not isinstance(func, str):
--> 507 return self._transform_general(func, *args, **kwargs)
508
509 elif func not in base.transform_kernel_allowlist:
~/DS/RS/rs_env/lib/python3.8/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
535 res = res._values
536
--> 537 results.append(klass(res, index=group.index))
538
539 # check for empty "results" to avoid concat ValueError
~/DS/RS/rs_env/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
346 try:
347 if len(index) != len(data):
--> 348 raise ValueError(
349 f"Length of passed values is {len(data)}, "
350 f"index implies {len(index)}."
ValueError: Length of passed values is 6, index implies 4.
Output should be like:
ID Feature_1 Custom average
0 1 4 4.0
1 2 5 5.0
2 3 6 6.0
3 2 73 39.0
4 3 2 4.0
5 1 21 12.5
6 2 13 43.0
7 1 45 33.0
8 3 32 4.0
9 3 9 20.5
10 3 18 13.5
11 2 45 29.0
12 1 39 42.0
Your new solution working, also is possible omit lambda function for simplier solution (with lambda working too):
test_df['Custom average'] = test_df.groupby('ID')['Feature_1'].transform(moving_average_mit_anfang,2)
print (test_df)
ID Feature_1 Custom average
0 1 4 4.0
1 2 5 5.0
2 3 6 6.0
3 2 73 39.0
4 3 2 4.0
5 1 21 12.5
6 2 13 43.0
7 1 45 33.0
8 3 32 17.0
9 3 9 20.5
10 3 18 13.5
11 2 45 29.0
12 1 39 42.0