Search code examples
pythonpandasfillna

pandas fillna by group for multiple columns


In a dataset like this one (CSV format), where there are several columns with values, how can I use fillna alongside df.groupby("DateSent") to fill in all desired columns with min()/3 of the group?

In [5]: df.head()
Out[5]: 
  ID  DateAcquired  DateSent         data   value  measurement    values
0  1      20210518  20220110  6358.434713   556.0   317.869897  3.565781
1  1      20210719  20220210  6508.458382  1468.0   774.337509  5.565384
2  1      20210719  20220310  6508.466246     1.0    40.837533  1.278085
3  1      20200420  20220410  6507.664194    48.0    64.335047  1.604183
4  1      20210328  20220510  6508.451227     0.0    40.337486  1.270236

According to this other thread on SO, one way of doing it would be one by one:

df["data"]        = df.groupby("DateSent")["data"].transform(lambda x: x.fillna(x.min()/3))
df["value"]       = df.groupby("DateSent")["value"].transform(lambda x: x.fillna(x.min()/3))
df["measurement"] = df.groupby("DateSent")["measurement"].transform(lambda x: x.fillna(x.min()/3))
df["values"]      = df.groupby("DateSent")["values"].transform(lambda x: x.fillna(x.min()/3))

In my original dataset where I have 100000 such columns, I can technically loop over all desired column names. But is there a better/faster way of doing this? Perhaps something already implemented in pandas?


Solution

  • One way you could do this is to get all the columns you want to impute in a list - I will assume that you want all the numerical columns (except ID, DateAcquired, DataSent)

    fti = [i for i in df.iloc[:,3:].columns if df[i].dtypes != 'object'] # features to impute
    

    Then, you can create a new df, with only the imputed values:

    imputed = df.groupby("DateSent")[fti].transform(lambda x: x.fillna(x.min()/3))
    
    imputed.head(5)
              data   value  measurement    values
    0  6358.434713   556.0   317.869897  3.565781
    1  6508.458382  1468.0   774.337509  5.565384
    2  6508.466246     1.0    40.837533  1.278085
    3  6507.664194    48.0    64.335047  1.604183
    4  6508.451227     0.0    40.337486  1.270236
    

    Lastly you can concat:

    res = pd.concat([df[df.columns.symmetric_difference(imputed.columns)],imputed],axis=1)
    
    res.head(15)
    
        DateAcquired  DateSent ID         data   value  measurement    values
    0       20210518  20220110  1  6358.434713   556.0   317.869897  3.565781
    1       20210719  20220210  1  6508.458382  1468.0   774.337509  5.565384
    2       20210719  20220310  1  6508.466246     1.0    40.837533  1.278085
    3       20200420  20220410  1  6507.664194    48.0    64.335047  1.604183
    4       20210328  20220510  1  6508.451227     0.0    40.337486  1.270236
    5       20210518  20220610  1  6508.474031     3.0    15.000000  0.774597
    6       20210108  20220110  2  6508.402472   897.0   488.837335  4.421933
    7       20210110  20220210  2  6508.410493    52.0   111.000000  2.107131
    8       20210119  20220310  2  6508.419065   800.0   440.337387  4.196844
    9       20210108  20220410  2  6508.426063    89.0    84.837408  1.842144
    10      20200109  20220510  2  6507.647600   978.0   529.334996  4.601456
    11      20210919  20220610  2  6508.505563  1566.0   823.337655  5.738772
    12      20211214  20220612  2  6508.528918   152.0   500.000000  4.472136
    13      20210812  20220620  2  6508.497936   668.0   374.337631  3.869561
    14      20210909  20220630  2  6508.506350   489.0   284.837657  3.375427