Search code examples
pythonpython-3.xpandasscikit-learnsklearn-pandas

Sklearn KNN Imputer is missing some values


I was trying to impute a column with some NaNs using KNN imputer from Sk-learn. Things seemed to be working properly, but I realized that I still have some of the NaNs in the imputed column. What could be the reason? I already counted the NaNs before and after imputation.

Note: I've updated the code with the cleaning code I used before the imputation.

Input:

# Create row for both the singer and track name
train.insert(2,'Artist Track',(train['Artist Name']+ " " + train['Track Name']))

# Remove duplicates for same Artist, Song, and Class
# Sort values by Artist Track then columns with NaNs to possibly drop duplicates with NaNs
train.sort_values(by=['Artist Track','Popularity','key','instrumentalness'], inplace=True)
train.drop_duplicates(subset=['Artist Track', 'Class'], keep='first', inplace=True)

# Remove duplicates of tracks if instrumentalness duplicate is NaN
train.sort_values(by=['Artist Track','instrumentalness'], inplace=True)
dups_ins = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
ins_nans = np.where(train['instrumentalness'].isna())[0]
drop_ins = set(dups_ins).intersection(ins_nans)
train.drop(drop_ins, inplace=True)

# Remove duplicates of tracks if key duplicate is NaN
train.sort_values(by=['Artist Track','key'], inplace=True)
dups_key = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
key_nans = np.where(train['key'].isna())[0]
drop_key = set(dups_key).intersection(key_nans)
train.drop(drop_key, inplace=True)


# Remove duplicates of tracks if popularity duplicate is NaN
train.sort_values(by=['Artist Track','Popularity'], inplace=True)
dups_pop = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
pop_nans = np.where(train['Popularity'].isna())[0]
drop_pop = set(dups_pop).intersection(pop_nans)
train.drop(drop_pop, inplace=True)

train['instrumentalness'].isna().sum()

Output:

3452

Input:

from sklearn.impute import KNNImputer 
fea_transformer = KNNImputer(n_neighbors=3)
values = fea_transformer.fit_transform(train[['instrumentalness']])
train['instrumentalness'] = pd.DataFrame(values)
train['instrumentalness'].isna().sum()

Output:

472

Solution

  • First remark You are fitting the KNN imputer on the series itself :

    values = fea_transformer.fit_transform(train[['instrumentalness']])

    This is a waste of all the information from the other features you could use all of them to have a better imputation.

    Second remark : Your problem is not with KNNImputer but with how you assign values to your DataFrame. When you put it in its own DataFrame, you create a new index that is not aligned with the original one, hence the new NaNs. if you check from your code :

    from numpy import isnan
    print(sum(isnan(values).flatten())))
    
    0
    

    You'll see that values actually has no missing values.

    Here is a full working version :

    from sklearn.impute import KNNImputer 
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import StandardScaler
    fea_transformer = KNNImputer(n_neighbors=3)
    # Remark 1 : better to use all numeric data and scale it
    scaler = StandardScaler()
    pipe = make_pipeline(scaler, fea_transformer)
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    train_features = train.select_dtypes(include=numerics)
    values = pipe.fit_transform(train_features)
    # Remark 2 : make sure imputed data has the same index as the train data 
    imputed_df = pd.DataFrame(train_features, columns=train_features.columns, index=train.index) 
    
    train['instrumentalness'] = imputed_df['instrumentalness']
    print(train['instrumentalness'].isna().sum())
    
    0