I was trying to impute a column with some NaNs using KNN imputer from Sk-learn. Things seemed to be working properly, but I realized that I still have some of the NaNs in the imputed column. What could be the reason? I already counted the NaNs before and after imputation.
Note: I've updated the code with the cleaning code I used before the imputation.
Input:
# Create row for both the singer and track name
train.insert(2,'Artist Track',(train['Artist Name']+ " " + train['Track Name']))
# Remove duplicates for same Artist, Song, and Class
# Sort values by Artist Track then columns with NaNs to possibly drop duplicates with NaNs
train.sort_values(by=['Artist Track','Popularity','key','instrumentalness'], inplace=True)
train.drop_duplicates(subset=['Artist Track', 'Class'], keep='first', inplace=True)
# Remove duplicates of tracks if instrumentalness duplicate is NaN
train.sort_values(by=['Artist Track','instrumentalness'], inplace=True)
dups_ins = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
ins_nans = np.where(train['instrumentalness'].isna())[0]
drop_ins = set(dups_ins).intersection(ins_nans)
train.drop(drop_ins, inplace=True)
# Remove duplicates of tracks if key duplicate is NaN
train.sort_values(by=['Artist Track','key'], inplace=True)
dups_key = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
key_nans = np.where(train['key'].isna())[0]
drop_key = set(dups_key).intersection(key_nans)
train.drop(drop_key, inplace=True)
# Remove duplicates of tracks if popularity duplicate is NaN
train.sort_values(by=['Artist Track','Popularity'], inplace=True)
dups_pop = train[train.duplicated(subset=['Artist Track'], keep='first')==True].index
pop_nans = np.where(train['Popularity'].isna())[0]
drop_pop = set(dups_pop).intersection(pop_nans)
train.drop(drop_pop, inplace=True)
train['instrumentalness'].isna().sum()
Output:
3452
Input:
from sklearn.impute import KNNImputer
fea_transformer = KNNImputer(n_neighbors=3)
values = fea_transformer.fit_transform(train[['instrumentalness']])
train['instrumentalness'] = pd.DataFrame(values)
train['instrumentalness'].isna().sum()
Output:
472
First remark You are fitting the KNN imputer on the series itself :
values = fea_transformer.fit_transform(train[['instrumentalness']])
This is a waste of all the information from the other features you could use all of them to have a better imputation.
Second remark :
Your problem is not with KNNImputer
but with how you assign values
to your DataFrame. When you put it in its own DataFrame, you create a new index that is not aligned with the original one, hence the new NaNs. if you check from your code :
from numpy import isnan
print(sum(isnan(values).flatten())))
0
You'll see that values actually has no missing values.
Here is a full working version :
from sklearn.impute import KNNImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
fea_transformer = KNNImputer(n_neighbors=3)
# Remark 1 : better to use all numeric data and scale it
scaler = StandardScaler()
pipe = make_pipeline(scaler, fea_transformer)
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
train_features = train.select_dtypes(include=numerics)
values = pipe.fit_transform(train_features)
# Remark 2 : make sure imputed data has the same index as the train data
imputed_df = pd.DataFrame(train_features, columns=train_features.columns, index=train.index)
train['instrumentalness'] = imputed_df['instrumentalness']
print(train['instrumentalness'].isna().sum())
0