Instead of filling missing values by 0 or by the variable mean, I would like to fill them with by the mean of the other similar observation on the dataset.
Example : A, B, C and D are a single sample of various measures.
V1 V2 V3
A 8.7 4.3 5
B nan 2.5 3
C 0.1 2.5 3
D 1.5 2.5 3
So doing a K-Means clustering on variable V2 and V3. Returns 2 clusters : one with A and second one with B, C, D. Because the 2nd cluster is the same as B, I want to fill the missing value on variable V1 with the 2nd cluster mean values for V1
So the missing value will be 0.8 for row B in V1 because is the mean of 0.1 and 1.5 corresponding to C and D values on V1.
This is a very simple example so I would like to know how to do this with Python for a large dataset.
Thanks for your help for a code able to do that quickly and to fill "automatically" the missing values in that way.
Use KNNInputer
from sklearn
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
data = {'V1': [8.7, np.nan, 0.1, 1.5],
'V2': [4.3, 2.5, 2.5, 2.5],
'V3': [5, 3, 3, 3]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
out = imputer.fit_transform(df)
out = pd.DataFrame(out, index=df.index, columns=df.columns)
>>> out
V1 V2 V3
0 8.7 4.3 5.0
1 0.8 2.5 3.0
2 0.1 2.5 3.0
3 1.5 2.5 3.0