Search code examples
pythonpandasscikit-learnlabel-encoding

Label Encoder and Inverse_Transform on SOME Columns


Suppose I have a dataframe like the following

df = pd.DataFrame({'animal':  ['Dog',   'Bird',  'Dog',   'Cat'],
                   'color':   ['Black', 'Blue',  'Brown', 'Black'],
                   'age':     [1,        10,       3,      6],
                   'pet':     [1,         0,       1,      1],
                   'sex':     ['m',      'm',     'f',    'f'],
                   'name':    ['Rex',    'Gizmo', 'Suzy', 'Boo']})

I want to use label encoder to encode "animal", "color", "sex" and "name", but I don't need to encode the other two columns. I also want to be able to inverse_transform the columns afterwards.

I have tried the following, and although encoding works as I'd expect it to, reversing does not.

to_encode = ["animal", "color", "sex", "name"]
le = LabelEncoder()
for col in to_encode:
     df[col] = fit_transform(df[col])


## to inverse:
for col in to_encode:
    df[col] = inverse_transform(df[col])

The inverse_transform function results in the following dataframe:

animal color age pet sex name
Rex Boo 1 1 Gizmo Rex
Boo Gizmo 10 0 Gizmo Gizmo
Rex Rex 3 1 Boo Suzy
Gizmo Boo 6 1 Boo Boo

It's obviously not right, but I'm not sure how else I'd accomplish this?

Any advice would be appreciated!


Solution

  • As you can see in your output, when you are trying to inverse_transform, it seems that the code is only using the information he obtained for the last column "name". You can see that because now, all the rows of your columns have values related to names. You should have one LabelEncoder() for each column.

    The key here is to have one LabelEncoder fitted for each different column. To do this, I recommend you save them in a dictionary:

    to_encode = ["animal", "color", "sex", "name"]
    d={}
    for col in to_encode:
        d[col]=preprocessing.LabelEncoder().fit(df[col]) #For each column, we create one instance in the dictionary. Take care we are only fitting now.
    

    If we print the dictionary now, we will obtain something like this:

    {'animal': LabelEncoder(),
     'color': LabelEncoder(),
     'sex': LabelEncoder(),
     'name': LabelEncoder()}
    

    As we can see, for each column we want to transform, we have his LabelEncoder() information. This means, for example, that for the animal LabelEncoder it saves that 0 is equal to bird, 1 equal to cat, ... And the same for each column.

    Once we have every column fitted, we can proceed to transform, and then, if we want to inverse_transform. The only thing to be aware is that every transform/inverse_transform have to use the corresponding LabelEncoder of this column.

    Here we transform:

    for col in to_encode:
        df[col] = d[col].transform(df[col]) #Be aware we are using the dictionary
    
    df
    
    animal  color   age pet sex name
    0   2   0   1   1   1   2
    1   0   1   10  0   1   1
    2   2   2   3   1   0   3
    3   1   0   6   1   0   0
    

    And, once the df is transformed, we can inverse_transform:

    for col in to_encode:
        df[col] = d[col].inverse_transform(df[col])
    
    df
    
    animal  color   age pet sex name
    0   Dog Black   1   1   m   Rex
    1   Bird Blue   10  0   m   Gizmo
    2   Dog Brown   3   1   f   Suzy
    3   Cat Black   6   1   f   Boo
    

    One interesting idea could be using ColumnTransformer, but unfortunately, it doesn't suppport inverse_transform().