Search code examples
pythonpandasscikit-learnmissing-datamode

Replace Missing Values with Most Frequent number under Condition


I'm trying to replace missing values of column "Age" but under condition of other columns on this data Titanic - Machine Learning from Disaster

df.Age[(df['Sex'] == 0) & (df['Pclass'] == 1)]

I tried to do that using SimpleImputer:

from sklearn.impute import SimpleImputer
Imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

Imputer.fit_transform( pd.DataFrame(df.Age[(df['Sex'] == 0) & (df['Pclass'] == 1)]) )

but it doesn't work and tried to save values to the column:

df.loc[(df.Age.isnull()) & (df.Age[(df['Sex'] == 0) & (df['Pclass'] == 1)]), 'Age'] = Imputer.fit_transform( pd.DataFrame(df.Age[(df['Sex'] == 0) & (df['Pclass'] == 1)]) )

but doesn't work also.

I tried to do it manually using fillna()

df.loc[(df['Sex'] == 0) & (df['Pclass'] == 1), 'Age'].fillna(int(df.Age[(df['Sex'] == 0) & (df['Pclass'] == 1)].mode()), inplace=True)

I tried to use indexes to access rows and update their values:

mod = int(df.Age[(df['Sex'] == 0) & (df['Pclass'] == 1)].mode())
indices = df.loc[(df.Age.isnull()) & (df.Sex == 0) & (df.Pclass == 1), 'Age'].isnull().index
df.loc[ind, 'Age'] = mod
df[(df['Sex'] == 0) & (df['Pclass'] == 1)]['Age'].isnull().sum()

it worked and the output was: 0, but when I'm trying to apply it in for loop it gives me an error

for i in range(1,3):
    for j in range(1,4):    
        indices = df.loc[(df.Sex == i) & (df.Pclass == j), 'Age'].isnull().index
        mod = int(df.Age[(df['Sex'] == i) & (df['Pclass'] == j)].mode())
        df.loc[ind, 'Age'] = mod

I want to know what is the wrong of first 2 ways an why the 3rd doesn't work in loop?


Solution

  • This solution works well but I don't know why above don't work!

    Imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
    for i in range(2):
        for j in range(1,4):
            ls = np.array(df.Age[((df.Sex==i) & (df.Pclass==j))]).reshape(-1,1)
            df.Age[((df.Sex==i) & (df.Pclass==j))] = Imputer.fit_transform(ls)[:,0]
    df.Age.isnull().sum()