Search code examples
pythonpandasfillna

What's wrong with fillna in Pandas running twice?


I'm new to Pandas and Numpy. I was trying to solve the Kaggle | Titanic Dataset. Now I have to fix the two columns, "Age" and "Embarked" because they contains NAN.

Now I tried the fillna without any success, soon to discover that I was missing the inplace = True.

Now I attached them. But the first imputation was successful but the second one was not. I tried searching in SO and google, but did not find anything useful. Please help me.

Here's the code that I was trying.

# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
    
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)

print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size

and I got the output as

0
2

However I managed to get what I want without using inplace=True

titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())

But I am curious what's with the second usage of inplace=True.

Please bear with if I'm asking something which is extremely stupid because I' totally new and I may miss small things. Any help is appreciated. Thanks in advance.


Solution

  • pd.Series.mode returns a Series.

    A variable has a single arithmetic mean and a single median but it may have several modes. If more than one value has the highest frequency, there will be multiple modes.

    pandas operates on labels.

    titanic_df.mean()
    Out: 
    PassengerId    446.000000
    Survived         0.383838
    Pclass           2.308642
    Age             29.699118
    SibSp            0.523008
    Parch            0.381594
    Fare            32.204208
    dtype: float64
    

    If I were to use titanic_df.fillna(titanic_df.mean()) it would return a new DataFrame where the column PassengerId is filled with 446.0, column Survived is filled with 0.38 and so on.

    However, if I call the mean method on a Series, the returning value is a float:

    titanic_df['Age'].mean()
    Out: 29.69911764705882
    

    There is no label associated here. So if I use titanic_df.fillna(titanic_df['Age'].mean()) all the missing values in all the columns will be filled with 29.699.

    Why the first attempt was not successful

    You tried to fill the entire DataFrame, titanic_df with titanic_df["Embarked"].mode(). Let's check the output first:

    titanic_df["Embarked"].mode()
    Out: 
    0    S
    dtype: object
    

    It is a Series with a single element. The index is 0 and the value is S. Now, remember how it would work if we used titanic_df.mean() to fill: it would fill each column with the corresponding mean value. Here, we only have one label. So it will only fill values if we have a column named 0. Try adding df[0] = np.nan and executing your code again. You'll see that the new column is filled with S.

    Why the second attempt was (not) successful

    The right hand side of the equation, titanic_df.fillna(titanic_df["Embarked"].mode()) returns a new DataFrame. In this new DataFrame, Embarked column still has nan's:

    titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
    Out: 2
    

    However you didn't assign it back to the entire DataFrame. You assigned this DataFrame to a Series - titanic_df['Embarked']. It didn't actually fill the missing values in the Embarked column, it just used the index values of the DataFrame. If you actually check the new column, you'll see numbers 1, 2, ... instead of S, C and Q.

    What you should do instead

    You are trying to fill a single column with a single value. First, disassociate that value from its label:

    titanic_df['Embarked'].mode()[0]
    Out: 'S'
    

    Now, it is not important if you use inplace=True or assign the result back. Both

    titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])
    

    and

    titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)
    

    will fill the missing values in the Embarked column with S.

    Of course this assumes you want to use the first value if there are multiple modes. You may need to improve your algorithm there (for example randomly select from the values if there are multiple modes).