Search code examples
pythonpandasdataframeseriesfillna

how to fill NaNs based on repeating id?


I have one basic df - a monthly report of loans , it's 1700 contract ids of loans given per some date(monthly) e.g. report per 01.01.2019:

contract_id    loan_sum   expiry_days_01_01_2019
  1            1000       20
  2            2300       3
  3            500        17
  4            1400       6
  5            890        5

another for 01.02.2019

 contract_id loan_sum   expiry_days_01_02_2019
      1      7000       20
      2      3000       39
      3      500        130
      4      9400       89
      5      909        7

etc.

I have 5 reports with same number of ids. I have combined them under another.

Now i have many nan values for expiry-date column, like in a matrix with diagonal filled properly with ones.

 id  contract_id     loan_sum   expiry_days_01_01_2019 expiry_01_02_2019
   0           1      7000       20                      nan
   1           2      3000       39                      nan
   2           3      500        130                     nan
   3           4      9400       89                      nan
   4           5      909        7                       nan
   5           1      7000       nan                     20
   6           2      3000       nan                     39
   7           3      500        nan                     130
   8           4      9400       nan                     89
   9           5      909        nan                     7
   ...           ...                     ...

When the contract id repeats it should pull the same date, but instead they become NaNs.

How to fill nan values based on same , repeating contract id?


Solution

  • you could use groupby.transform with first. if the altogether dataframe is called dfm, then:

    # I assumed you put the dataframes together like this
    dfm = pd.concat([df1, df2], axis=0)
    
    col_exp = dfm.filter(like='expiry').columns
    dfm[col_exp] = dfm.groupby('contract_id')[col_exp].transform('first')
    print (dfm)
       contract_id  loan_sum  expiry_days_01_01_2019  expiry_days_01_02_2019
    0            1      1000                    20.0                    20.0
    1            2      2300                     3.0                    39.0
    2            3       500                    17.0                   130.0
    3            4      1400                     6.0                    89.0
    4            5       890                     5.0                     7.0
    0            1      7000                    20.0                    20.0
    1            2      3000                     3.0                    39.0
    2            3       500                    17.0                   130.0
    3            4      9400                     6.0                    89.0
    4            5       909                     5.0                     7.0
    

    But I think changing the way the altogether dataframe is build could be another solution and create multiindex columns, something like:

    list_dfs = [df1, df2]
    dfm = pd.concat([df_.set_index('contract_id')
                        .rename(columns=lambda x: x.split('_')[0]) 
                     for df_ in list_dfs], 
                     keys=[df_.filter(like='expiry').columns[0][-10:]  
                           for df_ in list_dfs], 
                     axis=1)
    print (dfm)
                01_01_2019        01_02_2019       
                      loan expiry       loan expiry
    contract_id                                    
    1                 1000     20       7000     20
    2                 2300      3       3000     39
    3                  500     17        500    130
    4                 1400      6       9400     89
    5                  890      5        909      7