python pandas scikit-learn label-encoding

Label Encoder and Inverse_Transform on SOME Columns

Suppose I have a dataframe like the following

df = pd.DataFrame({'animal':  ['Dog',   'Bird',  'Dog',   'Cat'],
                   'color':   ['Black', 'Blue',  'Brown', 'Black'],
                   'age':     [1,        10,       3,      6],
                   'pet':     [1,         0,       1,      1],
                   'sex':     ['m',      'm',     'f',    'f'],
                   'name':    ['Rex',    'Gizmo', 'Suzy', 'Boo']})

I want to use label encoder to encode "animal", "color", "sex" and "name", but I don't need to encode the other two columns. I also want to be able to inverse_transform the columns afterwards.

I have tried the following, and although encoding works as I'd expect it to, reversing does not.

to_encode = ["animal", "color", "sex", "name"]
le = LabelEncoder()
for col in to_encode:
     df[col] = fit_transform(df[col])


## to inverse:
for col in to_encode:
    df[col] = inverse_transform(df[col])

The inverse_transform function results in the following dataframe:

animal	color	age	pet	sex	name
Rex	Boo	1	1	Gizmo	Rex
Boo	Gizmo	10	0	Gizmo	Gizmo
Rex	Rex	3	1	Boo	Suzy
Gizmo	Boo	6	1	Boo	Boo

It's obviously not right, but I'm not sure how else I'd accomplish this?

Any advice would be appreciated!

Solution

As you can see in your output, when you are trying to inverse_transform, it seems that the code is only using the information he obtained for the last column "name". You can see that because now, all the rows of your columns have values related to names. You should have one LabelEncoder() for each column.

The key here is to have one LabelEncoder fitted for each different column. To do this, I recommend you save them in a dictionary:

to_encode = ["animal", "color", "sex", "name"]
d={}
for col in to_encode:
    d[col]=preprocessing.LabelEncoder().fit(df[col]) #For each column, we create one instance in the dictionary. Take care we are only fitting now.

If we print the dictionary now, we will obtain something like this:

{'animal': LabelEncoder(),
 'color': LabelEncoder(),
 'sex': LabelEncoder(),
 'name': LabelEncoder()}

As we can see, for each column we want to transform, we have his LabelEncoder() information. This means, for example, that for the animal LabelEncoder it saves that 0 is equal to bird, 1 equal to cat, ... And the same for each column.

Once we have every column fitted, we can proceed to transform, and then, if we want to inverse_transform. The only thing to be aware is that every transform/inverse_transform have to use the corresponding LabelEncoder of this column.

Here we transform:

for col in to_encode:
    df[col] = d[col].transform(df[col]) #Be aware we are using the dictionary

df

animal  color   age pet sex name
0   2   0   1   1   1   2
1   0   1   10  0   1   1
2   2   2   3   1   0   3
3   1   0   6   1   0   0

And, once the df is transformed, we can inverse_transform:

for col in to_encode:
    df[col] = d[col].inverse_transform(df[col])

df

animal  color   age pet sex name
0   Dog Black   1   1   m   Rex
1   Bird Blue   10  0   m   Gizmo
2   Dog Brown   3   1   f   Suzy
3   Cat Black   6   1   f   Boo

One interesting idea could be using ColumnTransformer, but unfortunately, it doesn't suppport inverse_transform().