I have a Series/DataFrame such as this one. The elements contained in them are lists with one or more values:
0 NaN
1 [40]
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 [35]
10 NaN
11 NaN
12 [28]
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
Name: tags, dtype: object
I would like to fill in the missing values with the latest value upto five consecutive entries. A ffill with limit 5 would be most appropriate. However my use case is such that I want the forward fill to overlap. My expected output would look something like this:
0 NaN
1 [40]
2 [40]
3 [40]
4 [40]
5 [40]
6 [40]
7 NaN
8 NaN
9 [35]
10 [35]
11 [35]
12 [28]
13 [35, 28]
14 [35, 28]
15 [28]
16 [28]
17 [28]
Name: tags, dtype: object
The example above is for the sake of simplicity, and this function I am describing is to be part of a much larger pd.groupby
operation with several more tags and as such python loops aren't much help. I don't care about the indices with the tags themselves, only those that are being filled are important to me. Maybe an approach with pandas cumsum and slicing out based on index difference would work here?
Any ideas to approach this problem would prove incredibly helpful to me. Thanks in advance!
You could try:
# fill na by empty list
df['tags'] = [[] if na else s for s, na in zip(df['tags'], df['tags'].isna())]
# compute rolling windows
df['res'] = [[l for ls in window for l in ls] for window in df['tags'].rolling(5)]
print(df)
Output
tags res
0 [] []
1 [40] [40]
2 [] [40]
3 [] [40]
4 [] [40]
5 [] [40]
6 [] []
7 [] []
8 [] []
9 [35] [35]
10 [] [35]
11 [] [35]
12 [28] [35, 28]
13 [] [35, 28]
14 [] [28]
15 [] [28]
16 [] [28]
17 [] []
As an alternative, you could use chain.from_iterable:
from itertools import chain
# compute rolling windows
df['res'] = [list(chain.from_iterable(window)) for window in df['tags'].rolling(5)]
See this answer for a comparison on several methods for list flattening in pandas.