Search code examples
pythonpandaskaggle

How to replace NaN values where the other columns meet a certain criteria?


I am working on the titanic datset from Kaggle and am trying to replace the NaN values in one column based on information from the other columns.

In my specific example I am trying to replace the unknown age of male, 1st class passengers with the average age of male, 1st class passengers.

How do I do this?

I have been able to segment the data and replace the null values of that new dataframe, but it doesn't carry over to the original dataframe and I am a bit unclear on how to make it do so.

Here is my code:

missingage_1stclass_male = pd.DataFrame(
    titanic[
        (titanic['Age'].isnull()) &
        (titanic['Pclass'] == 1) &
        (titanic['Sex'] == 'male')
    ]
)
missingage_1stclass_male.Age.fillna(40.5, inplace=True)

My original dataframe with all the values is named titanic.


Solution

  • I am trying to replace the unknown age of male, 1st class passengers with the average age of male, 1st class passengers.

    You can split the problem into 2 steps. First calculate the average age of male, 1st class passengers:

    mask = (df['Pclass'] == 1) & (df['Sex'] == 'male')
    avg_filler = df.loc[mask, 'Age'].mean()
    

    Then update values satisfying your criteria:

    df.loc[df['Age'].isnull() & mask, 'Age'] = avg_filler