Search code examples
pythonpandasconditional-statementspandas-groupbyfillna

Python fillna based on a condition


I have the following dataframe grouped by datafile and I want to fillna(method ='bfill') only for those 'groups' that contain more than half of the data.

df.groupby('datafile').count()

datafile   column1   column2 column3 column4

datafile1     5         5       3       4         
datafile2     5         5       4       5
datafile3     5         5       5       5 
datafile4     5         5       0       0     
datafile5     5         5       1       1

As you can see in the df above, I'd like to fill those groups that contain most of the information but not those who has none or little information. So I was thinking in a condition something like fillna those who have more than half of the counts and don't fill the rest or those with less than half.

I'm struggling on how to set up my condition since it involves working with a result of a groupby and the original df.

Help is appreciated it.

example df:

index        datafile    column1   column2 column3 column4

0            datafile1      5       5        NaN      20         
1            datafile1      6       6        NaN      21         
2            datafile1      7       7        9        NaN         
3            datafile1      8       8        10       23         
4            datafile1      9       9        11       24         
5            datafile2      3       3        2        7         
6            datafile2      4       4        3        8         
7            datafile2      5       5        4        9         
8            datafile2      6       6        NaN      10         
9            datafile2      7       7        6        11         
10           datafile3      10      10       24       4         
11           datafile3      11      11       25       5         
12           datafile3      12      12       26       6         
13           datafile3      13      13       27       7         
14           datafile3      14      14       28       8         
15           datafile4      4       4        NaN      NaN         
16           datafile4      5       5        NaN      NaN         
17           datafile4      6       6        NaN      NaN         
18           datafile4      7       7        NaN      NaN         
19           datafile4      8       8        NaN      NaN         
19           datafile4      9       9        NaN      NaN         
20           datafile5      7       7        1        3  
21           datafile5      8       8        NaN      NaN         
22           datafile5      9       9        NaN      NaN         
23           datafile5      10      10       NaN      NaN         
24           datafile5      11      1        NaN      NaN                

expected output df:

index        datafile    column1   column2 column3 column4

0            datafile1      5       5        9        20         
1            datafile1      6       6        9        21         
2            datafile1      7       7        9        23         
3            datafile1      8       8        10       23         
4            datafile1      9       9        11       24         
5            datafile2      3       3        2        7         
6            datafile2      4       4        3        8         
7            datafile2      5       5        4        9         
8            datafile2      6       6        6        10         
9            datafile2      7       7        6        11         
10           datafile3      10      10       24       4         
11           datafile3      11      11       25       5         
12           datafile3      12      12       26       6         
13           datafile3      13      13       27       7         
14           datafile3      14      14       28       8         
15           datafile4      4       4        NaN      NaN         
16           datafile4      5       5        NaN      NaN         
17           datafile4      6       6        NaN      NaN         
18           datafile4      7       7        NaN      NaN         
19           datafile4      8       8        NaN      NaN         
19           datafile4      9       9        NaN      NaN         
20           datafile5      7       7        1        3  
21           datafile5      8       8        NaN      NaN         
22           datafile5      9       9        NaN      NaN         
23           datafile5      10      10       NaN      NaN         
24           datafile5      11      1        NaN      NaN 

Solution

  • if the proportion of NON-null values ​​is greater than or equal to 0.5 in each column then it is filled with the bfill method:

    rate = 0.5
    not_na = df.notna()
    g = not_na.groupby(df['datafile'])
    df_fill = (
        df.bfill()         
        .where(
            g.transform('sum')
            .div(g['datafile'].transform('size'), axis=0)
            .ge(rate) |
            not_na
        )
    )
    print(df_fill)
    

        index   datafile  column1  column2  column3  column4
    0       0  datafile1        5        5      9.0     20.0
    1       1  datafile1        6        6      9.0     21.0
    2       2  datafile1        7        7      9.0     23.0
    3       3  datafile1        8        8     10.0     23.0
    4       4  datafile1        9        9     11.0     24.0
    5       5  datafile2        3        3      2.0      7.0
    6       6  datafile2        4        4      3.0      8.0
    7       7  datafile2        5        5      4.0      9.0
    8       8  datafile2        6        6      6.0     10.0
    9       9  datafile2        7        7      6.0     11.0
    10     10  datafile3       10       10     24.0      4.0
    11     11  datafile3       11       11     25.0      5.0
    12     12  datafile3       12       12     26.0      6.0
    13     13  datafile3       13       13     27.0      7.0
    14     14  datafile3       14       14     28.0      8.0
    15     15  datafile4        4        4      NaN      NaN
    16     16  datafile4        5        5      NaN      NaN
    17     17  datafile4        6        6      NaN      NaN
    18     18  datafile4        7        7      NaN      NaN
    19     19  datafile4        8        8      NaN      NaN
    20     19  datafile4        9        9      NaN      NaN
    21     20  datafile5        7        7      1.0      3.0
    22     21  datafile5        8        8      NaN      NaN
    23     22  datafile5        9        9      NaN      NaN
    24     23  datafile5       10       10      NaN      NaN
    25     24  datafile5       11        1      NaN      NaN
    

    Also we can use:

    m = (not_na.groupby(df['datafile'], sort=False)
               .sum()
               .div(df['datafile'].value_counts(), axis=0)
               .ge(rate)
               .reindex(df['datafile']).reset_index(drop=True))
    df.bfill().where(m | not_na)
    

    both methods have similar returns for the sample dataframe

    %%timeit
    rate = 0.5
    not_na = df.notna()
    m = (not_na.groupby(df['datafile'], sort=False)
               .sum()
               .div(df['datafile'].value_counts(),axis=0)
               .ge(rate)
               .reindex(df['datafile']).reset_index(drop=True))
    df.bfill().where(m | not_na)
    11.1 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    %%timeit
    rate = 0.5
    not_na = df.notna()
    g = not_na.groupby(df['datafile'])
    df_fill = (df.bfill()
                 .where(g.transform('sum').div(g['datafile'].transform('size'),
                                               axis=0).ge(rate) |
                          not_na)
                )
    12.9 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)