Search code examples
pythonlabel-encodinginverse-transform

Python - How to reverse the encoding of data encoded with LabelEncoder after it has been split by train_test_split?


I am trying to export an unecoded version of a dataset which was encoded using LabelEncoder (from sklearn.preprocessing, to enable application of machine learning algorithms) and was subsequently split into training and test datasets (with train_test_split).

I want to export the test data to excel but with the original values. The examples that I've found till now, use the inverse_transform method of the LabelEncoder on only one variable. I want to apply it automatically on multiple columns that were encoded in the first place.

Here's an example data:

# data
code = ('A B C D A B C D E F').split()
sp = ('animal bird animal animal animal bird animal animal bird thing').split()
res = ('yes, yes, yes, yes, no, no, yes, no, yes, no').split(", ")
 
data =pd.DataFrame({'code':code, 'sp':sp, 'res':res})
data

Assuming 'res' to be the target variable and 'code' & 'sp' to be the features.


Solution

  • Here you go:

    # data
    code = ('A B C D A B C D E F').split()
    sp = ('animal bird animal animal animal bird animal animal bird thing').split()
    res = ('yes, yes, yes, yes, no, no, yes, no, yes, no').split(", ")
     
    data = pd.DataFrame({'code':code, 'sp':sp, 'res':res})
    data
    

    enter image description here

    # creating LabelEncoder object
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    
    # encoding
    dfe = pd.DataFrame()    # created empty dataframe for saving encoded values
    for column in data.columns:
        dfe[column] = le.fit_transform(data[column])
    dfe
    

    enter image description here

    # saving features
    X = dfe[['code','sp']]
    
    # saving target
    y = dfe['res']
    
    # splitting into training & test data
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)
    
    X_train
    

    enter image description here

    # reversal of encoding
    dfr_train = X_train.copy()
    for column in X.columns:
        le.fit(data[column])   # you fit the column before it was encoded here
    
    # now that python has the above encoding in its memory, we can ask it to reverse such 
    # encoding in the corresponding column having encoded values of the split dataset
    
        dfr_train[column] = le.inverse_transform(X_train[column])
    dfr_train
    

    enter image description here

    You can do the same for test data.

    # reversal of encoding of data
    dfr_test = X_test.copy()
    for column in X.columns:
        le.fit(data[column])
        dfr_test[column] = le.inverse_transform(X_test[column])
    dfr_test
    

    Here is the full training data (features + variables) for export:

    # reverse encoding of target variable y
    le.fit(data['res'])
    dfr_train['res'] = le.inverse_transform(y_train)
    dfr_train     # unencoded training data, ready for export
    

    enter image description here