Search code examples
pythonscikit-learncategorical-data

Losing my target variable when encoding categorial variables


I am dealing with a little challenge. I am trying to create a logistic regression model (multicass). Some of my variables are categorical, therefore I'm trying to encode them.

My initial dataset looks like that:

enter image description here

The column I want to predict is action1_preflop, it contains 3 possibles classes: "r","c","f"

When encoding categorical features, I end up losing the variable I want to predict as it gets converted into 3 sub-variables: action1_preflop_r action1_preflop_f action1_preflop_c

Below is the new dataframe after encoding

       tiers  tiers2_theory  ...  action1_preflop_f  action1_preflop_r
0          7             11  ...                  1                  0
1          1              7  ...                  0                  1
2          5             11  ...                  1                  0
3          1             11  ...                  0                  1
4          1              7  ...                  0                  1
     ...            ...  ...                ...                ...
31007      4             11  ...                  0                  1
31008      1             11  ...                  0                  1
31009      1             11  ...                  0                  1
31010      1             11  ...                  0                  1
31011      2              7  ...                  0                  1

[31012 rows x 11 columns]

Could you please let me know how I am supposed to deal with those new variables considering that the initial variable before being encoded was actually the variable I wanted to target from prediction?

Thanks for the help

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn import linear_model

df_raw = pd.read_csv('\\Users\\rapha\\Desktop\\Consulting\\Poker\\Tables test\\SB_preflop_a1_prob V1.csv', sep=";")




#Select categorical features only & use binary encoding

feature_cols = ['tiers','tiers2_theory','tiers3_theory','assorties','score','proba_preflop','action1_preflop']

df_raw = df_raw[feature_cols]
cat_features = df_raw.select_dtypes(include=[object])
num_features = df_raw.select_dtypes(exclude=[object])
df = num_features.join(pd.get_dummies(cat_features))
df = df.select_dtypes(exclude = [object])

df_outcome = df.action1_preflop
df_variables = df.drop('action1_preflop',axis=1)




x = df_variables
y = df.action1_preflop

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)

lm = linear_model.LogisticRegression(multi_class='ovr', solver='liblinear')
lm.fit(x_train, y_train)

predict_test=lm.predict(x_test)
print(lm.score(x_test, y_test))

Solution

  • You should leave the 'action1_preflop' out of the 'cat_features' dataframe and include it in the 'num_features' dataframe:

    cat_features = df_raw.select_dtypes(include=[object])
    cat_features = cat_features.drop(['action1_preflop'], axis=1)
    num_features = df_raw.select_dtypes(exclude=[object])
    num_features = pd.concat([num_features, df_raw['action1_preflop']