Search code examples
pythonscikit-learnimputationlabel-encoding

LabelEncoder cannot inverse_transform (unseen labels) after imputing missing values


I'm at a beginner to intermediate data science level. I want to impute missing values from a dataframe using knn.

As the dataframe contains strings and floats, I need to encode / decode values using LabelEncoder.

My method is as follows:

  1. Replace NaN to be able to encode
  2. Encode the text values and put them in a dictionary
  3. Retrieve the NaN (previously converted) to be imputed with knn
  4. Assign values with knn
  5. Decode values from the dictionary

Unfortunately, in the last step, imputing values adds new values that cannot be decoded (unseen labels error message).

Could you please explain to me what I am doing wrong? Ideally help me to correct it please. Before concluding, I wanted to say that I know that there are other tools like OneHotEncoder, but I don't know them well enough and I found LabelEncoder much more intuitive because you can see it directly in the dataframe (where LabelEncoder provides an array).

Please find below an example of my method, thank you very much for your help :

[1]

# Import libraries. 
import pandas as pd 
import numpy as np

# intialise data of lists. 
data = {'Name':['Jack', np.nan, 'Victoria', 'Nicolas', 'Victor', 'Brad'], 'Age':[59, np.nan, 29, np.nan, 65, 50], 'Car color':['Blue', 'Black', np.nan, 'Black', 'Grey', np.nan], 'Height ':[177, 150, np.nan, 180, 175, 190]} 

# Make a DataFrame 
df = pd.DataFrame(data) 

# Print the output. 
df 

Output : 
    Name    Age     Car color   Height
0   Jack    59.0    Blue    177.0
1   NaN     NaN     Black   150.0
2   Victoria    29.0    NaN     NaN
3   Nicolas     NaN     Black   180.0
4   Victor  65.0    Grey    175.0
5   Brad    50.0    NaN     190.0

[2]

# LabelEncoder does not work with NaN values, so I replace them with value '1000' : 
df = df.replace(np.nan, 1000)

# And to avoid errors, str columns must be set as strings (even '1000' value) : 
df[['Name','Car color']] = df[['Name','Car color']].astype(str)

df

Output 
    Name    Age     Car color   Height
0   Jack    59.0    Blue    177.0
1   1000    1000.0  Black   150.0
2   Victoria    29.0    1000    1000.0
3   Nicolas     1000.0  Black   180.0
4   Victor  65.0    Grey    175.0
5   Brad    50.0    1000    190.0

[3]

# Import LabelEncoder library : 
from sklearn.preprocessing import LabelEncoder

# define labelencoder :
le = LabelEncoder()

# Import defaultdict library to make a dict of labelencoder :
from collections import defaultdict

# Initiate a dict of LabelEncoder values :
encoder_dict = defaultdict(LabelEncoder)

# Make a new dataframe of LabelEncoder values :
df[['Name','Car color']] = df[['Name','Car color']].apply(lambda x: encoder_dict[x.name].fit_transform(x))

# Show output :
df

Output 
    Name    Age     Car color   Height
0   2   59.0    2   177.0
1   0   1000.0  1   150.0
2   5   29.0    0   1000.0
3   3   1000.0  1   180.0
4   4   65.0    3   175.0
5   1   50.0    0   190.0

[4]

#Reverse back 1000 to missing values in order to impute them : 
df = df.replace(1000, np.nan)
df

Output 

    Name    Age     Car color   Height
0   2   59.0    2   177.0
1   0   NaN     1   150.0
2   5   29.0    0   NaN
3   3   NaN     1   180.0
4   4   65.0    3   175.0
5   1   50.0    0   190.0

[5]

# Import knn imputer library to replace impute missing values : 
from sklearn.impute import KNNImputer

# Define imputer : 
imputer = KNNImputer(n_neighbors=2)

# impute and reassign index/colonnes : 
df = pd.DataFrame(np.round(imputer.fit_transform(df)),columns = df.columns)
df

Output 

    Name    Age     Car color   Height
0   2.0     59.0    2.0     177.0
1   0.0     47.0    1.0     150.0
2   5.0     29.0    0.0     165.0
3   3.0     44.0    1.0     180.0
4   4.0     65.0    3.0     175.0
5   1.0     50.0    0.0     190.0

[6]

# Decode data : 
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)

# Apply it to df -> THIS IS WHERE ERROR OCCURS :
df[['Name','Car color']].apply(inverse_transform_lambda)

Error message :

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-55-8a5e369215f6> in <module>()
----> 1 df[['Name','Car color']].apply(inverse_transform_lambda)

5 frames

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6926             kwds=kwds,
   6927         )
-> 6928         return op.get_result()
   6929 
   6930     def applymap(self, func):

/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in get_result(self)
    184             return self.apply_raw()
    185 
--> 186         return self.apply_standard()
    187 
    188     def apply_empty_result(self):

/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_standard(self)
    290 
    291         # compute the result using the series generator
--> 292         self.apply_series_generator()
    293 
    294         # wrap results

/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_series_generator(self)
    319             try:
    320                 for i, v in enumerate(series_gen):
--> 321                     results[i] = self.f(v)
    322                     keys.append(v.name)
    323             except Exception as e:

<ipython-input-54-f16f4965b2c4> in <lambda>(x)
----> 1 inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)

/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
    297                     "y contains previously unseen labels: %s" % str(diff))
    298         y = np.asarray(y)
--> 299         return self.classes_[y]
    300 
    301     def _more_tags(self):

IndexError: ('arrays used as indices must be of integer (or boolean) type', 'occurred at index Name')

Solution

  • Based on my comment you should do

    # Decode data : 
    inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x.astype(int)) # or x[].astype(int)