Search code examples
pythonpandaslistdataframefillna

Pandas: Forward fill with overlap on Series containing List Objects


I have a Series/DataFrame such as this one. The elements contained in them are lists with one or more values:

0      NaN
1     [40]
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9     [35]
10     NaN
11     NaN
12    [28]
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
Name: tags, dtype: object

I would like to fill in the missing values with the latest value upto five consecutive entries. A ffill with limit 5 would be most appropriate. However my use case is such that I want the forward fill to overlap. My expected output would look something like this:

0          NaN
1         [40]
2         [40]
3         [40]
4         [40]
5         [40]
6         [40]
7          NaN
8          NaN
9         [35]
10        [35]
11        [35]
12        [28]
13    [35, 28]
14    [35, 28]
15        [28]
16        [28]
17        [28]
Name: tags, dtype: object

The example above is for the sake of simplicity, and this function I am describing is to be part of a much larger pd.groupby operation with several more tags and as such python loops aren't much help. I don't care about the indices with the tags themselves, only those that are being filled are important to me. Maybe an approach with pandas cumsum and slicing out based on index difference would work here?

Any ideas to approach this problem would prove incredibly helpful to me. Thanks in advance!


Solution

  • You could try:

    # fill na by empty list 
    df['tags'] = [[] if na else s for s, na in zip(df['tags'], df['tags'].isna())]
    
    # compute rolling windows
    df['res'] = [[l for ls in window for l in ls] for window in df['tags'].rolling(5)]
    print(df)
    

    Output

        tags       res
    0     []        []
    1   [40]      [40]
    2     []      [40]
    3     []      [40]
    4     []      [40]
    5     []      [40]
    6     []        []
    7     []        []
    8     []        []
    9   [35]      [35]
    10    []      [35]
    11    []      [35]
    12  [28]  [35, 28]
    13    []  [35, 28]
    14    []      [28]
    15    []      [28]
    16    []      [28]
    17    []        []
    

    As an alternative, you could use chain.from_iterable:

    from itertools import chain
    
    # compute rolling windows
    df['res'] = [list(chain.from_iterable(window)) for window in df['tags'].rolling(5)]
    

    See this answer for a comparison on several methods for list flattening in pandas.