I'm new to Pandas and Numpy. I was trying to solve the Kaggle | Titanic Dataset. Now I have to fix the two columns, "Age" and "Embarked" because they contains NAN.
Now I tried the fillna
without any success, soon to discover that I was missing the inplace = True
.
Now I attached them. But the first imputation was successful but the second one was not. I tried searching in SO and google, but did not find anything useful. Please help me.
Here's the code that I was trying.
# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)
print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size
and I got the output as
0
2
However I managed to get what I want without using inplace=True
titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())
But I am curious what's with the second usage
of inplace=True
.
Please bear with if I'm asking something which is extremely stupid because I' totally new and I may miss small things. Any help is appreciated. Thanks in advance.
A variable has a single arithmetic mean and a single median but it may have several modes. If more than one value has the highest frequency, there will be multiple modes.
titanic_df.mean()
Out:
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
dtype: float64
If I were to use titanic_df.fillna(titanic_df.mean())
it would return a new DataFrame where the column PassengerId
is filled with 446.0, column Survived
is filled with 0.38 and so on.
However, if I call the mean method on a Series, the returning value is a float:
titanic_df['Age'].mean()
Out: 29.69911764705882
There is no label associated here. So if I use titanic_df.fillna(titanic_df['Age'].mean())
all the missing values in all the columns will be filled with 29.699.
You tried to fill the entire DataFrame, titanic_df
with titanic_df["Embarked"].mode()
. Let's check the output first:
titanic_df["Embarked"].mode()
Out:
0 S
dtype: object
It is a Series with a single element. The index is 0 and the value is S. Now, remember how it would work if we used titanic_df.mean()
to fill: it would fill each column with the corresponding mean value. Here, we only have one label. So it will only fill values if we have a column named 0
. Try adding df[0] = np.nan
and executing your code again. You'll see that the new column is filled with S
.
The right hand side of the equation, titanic_df.fillna(titanic_df["Embarked"].mode())
returns a new DataFrame. In this new DataFrame, Embarked
column still has nan
's:
titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2
However you didn't assign it back to the entire DataFrame. You assigned this DataFrame to a Series - titanic_df['Embarked']
. It didn't actually fill the missing values in the Embarked
column, it just used the index values of the DataFrame. If you actually check the new column, you'll see numbers 1, 2, ... instead of S, C and Q.
You are trying to fill a single column with a single value. First, disassociate that value from its label:
titanic_df['Embarked'].mode()[0]
Out: 'S'
Now, it is not important if you use inplace=True
or assign the result back. Both
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])
and
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)
will fill the missing values in the Embarked column with S
.
Of course this assumes you want to use the first value if there are multiple modes. You may need to improve your algorithm there (for example randomly select from the values if there are multiple modes).