I called train
a dataframe with titanic disaster data. Among the columns there are 'Pclass' denoting the class of the passenger (there are three classes, 1,2 and 3) and 'Age'. Not all the ages are known and I want to fill the Nan values in 'Age' with the mean value, but I want to put a different mean value according to the class.
The code is as follows
for i in np.arange(1,4):
obj=train[train['Pclass']==i]['Age'].mean()
train[train['Pclass']==i]['Age'].fillna(value=obj,inplace=True)
but when I call the dataframe, the Nan values are still there. Can somebody explain me why?
On my system that gives a SettingWithCopyWarning
, with a link to this caveats section in the Pandas documentation. The specifics are quite involved and there are multiple sections of explanation there that are useful to understand.
The recommended indexing method is to use loc
with a mask and a single fixed index, as in
nans = train['Age'].isna() # find all the relevant nans, once
for i in range(1,4):
mask = train['Pclass'] == i
# incorporate known nan locations to perform a single __setitem__ on a loc
train.loc[mask & nans, 'Age'] = train.loc[mask, 'Age'].mean()
This works because it's a single __setitem__
call (foo[item] = bar
) on the loc
result, which is guaranteed to be a view of the original DataFrame. In comparison, using fillna
on the result of a __getitem__
call (foo[item].fillna(...)
) may mean that the fillna
operates on a copy of a slice instead of on a view of the original DataFrame (which appears to be the case here). The inplace
parameter in fillna
will do what it's supposed to, but because it's working on a copy instead of the original you can't access the results.
From the docs I linked to,
Outside of simple cases, it’s very hard to predict whether [
__getitem__
] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the__setitem__
will modify [the original DataFrame] or a temporary object that gets thrown out immediately afterward. That’s what SettingWithCopy is warning you about!
As a minor bonus, using loc
and reusing mask
here can be more efficient than the chained indexing you started with.