Search code examples
pythonpandasencodingsklearn-pandas

How to reverse the encoding of sklearn LabelEncoder() after training the model?


So I am currently creating a machine learning model in Python which predicts the outcome of a football match. Below is the code from the training of the model:

features = ['Home Team',..., 'home_team_avg_Sh_last_3', 'Away Team',..., 'away_team_avg_Sh_last_3']
label = ['Match Result']
df_allteammerged[features + label]
Home Team ... Away Team ... Match Result
Arsenal ... Fulham ... Home Win
... ... ... ...
Brentford ... Everton ... Draw
encode = ['Home Team', 'Away Team', 'Match Result']

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
for e in encode:
  df_allteammerged[e] = enc.fit_transform(df_allteammerged[e])
df_allteammerged[features + label]
Home Team ... Away Team ... Match Result
0 ... 8 ... 1
... ... ... ...
3 ... 7 ... 2
X = df_allteammerged[features].values
y = df_allteammerged[label].values.flatten()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

from xgboost import XGBClassifier

model = XGBClassifier(objective="multi:softmax")
model.fit(X_train,y_train)
xgb_pred = model.predict(X_test)

Once the model is trained, I created a DataFrame which has the Actual (Match Result) Result, Predicted Result, Home Team and Away team for the test data

xtesthome = [i[0] for i in X_test]
xtestaway = [i[9] for i in X_test]
df_pred_compare = pd.DataFrame({"Actual Result": y_test, "Predicted Result": xgb_pred, "Home Team": xtesthome,"Away Team": xtestaway})
df_pred_compare

Then this will be saved to a CSV file

So, the main problem is I want to reverse the encoding on Home Team and Away Team so rather than the numbers the original team names will be present in the dataframe/csv file

I tried following the solution from this post Python - How to reverse the encoding of data encoded with LabelEncoder after it has been split by train_test_split? This included removing the .values from X and y to make them dataframes rather than arrays, but the returned dataframe did not reverse the encoding of Home Team and Away Team

Any help would be appreciated


Solution

  • I would create few different encoders and store them in dict to be able to reverse ecoding easily:

    encode = ['Home Team', 'Away Team', 'Match Result']
        
    from sklearn.preprocessing import LabelEncoder
    enc_dict = {}
    for e in encode:
        enc_dict[e] = LabelEncoder()
        df_allteammerged[e] = enc_dict[e].fit_transform(df_allteammerged[e])
    df_allteammerged[features + label]
    

    then after all modelling you can easily reverse encoding using this code:

    xtesthome = enc_dict['Home Team'].inverse_transform( [i[0] for i in X_test] )
    xtestaway = enc_dict['Away Team'].inverse_transform( [[i[9] for i in X_test] )