Search code examples
pythondataframefillna

fillna() used in a for loop doesn't affect the dataframe


I called train a dataframe with titanic disaster data. Among the columns there are 'Pclass' denoting the class of the passenger (there are three classes, 1,2 and 3) and 'Age'. Not all the ages are known and I want to fill the Nan values in 'Age' with the mean value, but I want to put a different mean value according to the class.

The code is as follows

for i in np.arange(1,4):
     obj=train[train['Pclass']==i]['Age'].mean()
     train[train['Pclass']==i]['Age'].fillna(value=obj,inplace=True)

but when I call the dataframe, the Nan values are still there. Can somebody explain me why?


Solution

  • On my system that gives a SettingWithCopyWarning, with a link to this caveats section in the Pandas documentation. The specifics are quite involved and there are multiple sections of explanation there that are useful to understand.

    The recommended indexing method is to use loc with a mask and a single fixed index, as in

    nans = train['Age'].isna() # find all the relevant nans, once
    for i in range(1,4):
        mask = train['Pclass'] == i
        # incorporate known nan locations to perform a single __setitem__ on a loc
        train.loc[mask & nans, 'Age'] = train.loc[mask, 'Age'].mean()
    

    This works because it's a single __setitem__ call (foo[item] = bar) on the loc result, which is guaranteed to be a view of the original DataFrame. In comparison, using fillna on the result of a __getitem__ call (foo[item].fillna(...)) may mean that the fillna operates on a copy of a slice instead of on a view of the original DataFrame (which appears to be the case here). The inplace parameter in fillna will do what it's supposed to, but because it's working on a copy instead of the original you can't access the results.

    From the docs I linked to,

    Outside of simple cases, it’s very hard to predict whether [__getitem__] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the __setitem__ will modify [the original DataFrame] or a temporary object that gets thrown out immediately afterward. That’s what SettingWithCopy is warning you about!

    As a minor bonus, using loc and reusing mask here can be more efficient than the chained indexing you started with.