I'm at a beginner to intermediate data science level. I want to impute missing values from a dataframe using knn
As the dataframe contains strings and floats
, I need to encode / decode values using LabelEncoder
My method is as follows:
- Replace NaN to be able to encode
- Encode the text values and put them in a dictionary
- Retrieve the NaN (previously converted) to be imputed with knn
- Assign values with knn
- Decode values from the dictionary
Unfortunately, in the last step, imputing values adds new values that cannot be decoded (unseen labels
error message).
Could you please explain to me what I am doing wrong? Ideally help me to correct it please. Before concluding, I wanted to say that I know that there are other tools like OneHotEncoder
, but I don't know them well enough and I found LabelEncoder much more intuitive because you can see it directly in the dataframe (where LabelEncoder
provides an array).
Please find below an example of my method, thank you very much for your help :
# Import libraries.
import pandas as pd
import numpy as np
# intialise data of lists.
data = {'Name':['Jack', np.nan, 'Victoria', 'Nicolas', 'Victor', 'Brad'], 'Age':[59, np.nan, 29, np.nan, 65, 50], 'Car color':['Blue', 'Black', np.nan, 'Black', 'Grey', np.nan], 'Height ':[177, 150, np.nan, 180, 175, 190]}
# Make a DataFrame
df = pd.DataFrame(data)
# Print the output.
Output :
Name Age Car color Height
0 Jack 59.0 Blue 177.0
1 NaN NaN Black 150.0
2 Victoria 29.0 NaN NaN
3 Nicolas NaN Black 180.0
4 Victor 65.0 Grey 175.0
5 Brad 50.0 NaN 190.0
# LabelEncoder does not work with NaN values, so I replace them with value '1000' :
df = df.replace(np.nan, 1000)
# And to avoid errors, str columns must be set as strings (even '1000' value) :
df[['Name','Car color']] = df[['Name','Car color']].astype(str)
Name Age Car color Height
0 Jack 59.0 Blue 177.0
1 1000 1000.0 Black 150.0
2 Victoria 29.0 1000 1000.0
3 Nicolas 1000.0 Black 180.0
4 Victor 65.0 Grey 175.0
5 Brad 50.0 1000 190.0
# Import LabelEncoder library :
from sklearn.preprocessing import LabelEncoder
# define labelencoder :
le = LabelEncoder()
# Import defaultdict library to make a dict of labelencoder :
from collections import defaultdict
# Initiate a dict of LabelEncoder values :
encoder_dict = defaultdict(LabelEncoder)
# Make a new dataframe of LabelEncoder values :
df[['Name','Car color']] = df[['Name','Car color']].apply(lambda x: encoder_dict[x.name].fit_transform(x))
# Show output :
Name Age Car color Height
0 2 59.0 2 177.0
1 0 1000.0 1 150.0
2 5 29.0 0 1000.0
3 3 1000.0 1 180.0
4 4 65.0 3 175.0
5 1 50.0 0 190.0
#Reverse back 1000 to missing values in order to impute them :
df = df.replace(1000, np.nan)
Name Age Car color Height
0 2 59.0 2 177.0
1 0 NaN 1 150.0
2 5 29.0 0 NaN
3 3 NaN 1 180.0
4 4 65.0 3 175.0
5 1 50.0 0 190.0
# Import knn imputer library to replace impute missing values :
from sklearn.impute import KNNImputer
# Define imputer :
imputer = KNNImputer(n_neighbors=2)
# impute and reassign index/colonnes :
df = pd.DataFrame(np.round(imputer.fit_transform(df)),columns = df.columns)
Name Age Car color Height
0 2.0 59.0 2.0 177.0
1 0.0 47.0 1.0 150.0
2 5.0 29.0 0.0 165.0
3 3.0 44.0 1.0 180.0
4 4.0 65.0 3.0 175.0
5 1.0 50.0 0.0 190.0
# Decode data :
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)
# Apply it to df -> THIS IS WHERE ERROR OCCURS :
df[['Name','Car color']].apply(inverse_transform_lambda)
Error message :
IndexError Traceback (most recent call last)
<ipython-input-55-8a5e369215f6> in <module>()
----> 1 df[['Name','Car color']].apply(inverse_transform_lambda)
5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6926 kwds=kwds,
6927 )
-> 6928 return op.get_result()
6930 def applymap(self, func):
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in get_result(self)
184 return self.apply_raw()
--> 186 return self.apply_standard()
188 def apply_empty_result(self):
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_standard(self)
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
294 # wrap results
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-54-f16f4965b2c4> in <lambda>(x)
----> 1 inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
297 "y contains previously unseen labels: %s" % str(diff))
298 y = np.asarray(y)
--> 299 return self.classes_[y]
301 def _more_tags(self):
IndexError: ('arrays used as indices must be of integer (or boolean) type', 'occurred at index Name')
Based on my comment you should do
# Decode data :
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x.astype(int)) # or x[].astype(int)