Search code examples
pythonpandasdataframemissing-data

fill null values in a column of pandas dataframe


I have a pandas dataframe which has more than 4 columns. Some values in the col1 are missing and I want to set those missing values based on the following approach:

  1. try to set it based on the average of values of col1 of the records that have the same col2,col3,col4 values
  2. if there is no such record, set it based on the average of values of col1 of the records that have the same col2,col3 values
  3. if there is still no such record, set it based on the average of values of col1 of the records that have the same col2 values
  4. If none of the above could be found, set it to the average of all other non-missing values in col1

What's the best way to do this?


Solution

  • Based on your logic, you can do something as follows, where each row of fillna corresponds to a bullet point in your question, in the same order:

    df['col1'] = (df['col1']
                   .fillna(df.groupby(['col2','col3','col4'])['col1'].transform('mean'))
                   .fillna(df.groupby(['col2','col3'])['col1'].transform('mean'))
                   .fillna(df.groupby(['col2'])['col1'].transform('mean')
                   .fillna(df['col1'].mean())
                 )