I am dealing with a little challenge. I am trying to create a logistic regression model (multicass). Some of my variables are categorical, therefore I'm trying to encode them.
My initial dataset looks like that:
The column I want to predict is action1_preflop
, it contains 3 possibles classes: "r","c","f"
When encoding categorical features, I end up losing the variable I want to predict as it gets converted into 3 sub-variables:
action1_preflop_r
action1_preflop_f
action1_preflop_c
Below is the new dataframe after encoding
tiers tiers2_theory ... action1_preflop_f action1_preflop_r
0 7 11 ... 1 0
1 1 7 ... 0 1
2 5 11 ... 1 0
3 1 11 ... 0 1
4 1 7 ... 0 1
... ... ... ... ...
31007 4 11 ... 0 1
31008 1 11 ... 0 1
31009 1 11 ... 0 1
31010 1 11 ... 0 1
31011 2 7 ... 0 1
[31012 rows x 11 columns]
Could you please let me know how I am supposed to deal with those new variables considering that the initial variable before being encoded was actually the variable I wanted to target from prediction?
Thanks for the help
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
df_raw = pd.read_csv('\\Users\\rapha\\Desktop\\Consulting\\Poker\\Tables test\\SB_preflop_a1_prob V1.csv', sep=";")
#Select categorical features only & use binary encoding
feature_cols = ['tiers','tiers2_theory','tiers3_theory','assorties','score','proba_preflop','action1_preflop']
df_raw = df_raw[feature_cols]
cat_features = df_raw.select_dtypes(include=[object])
num_features = df_raw.select_dtypes(exclude=[object])
df = num_features.join(pd.get_dummies(cat_features))
df = df.select_dtypes(exclude = [object])
df_outcome = df.action1_preflop
df_variables = df.drop('action1_preflop',axis=1)
x = df_variables
y = df.action1_preflop
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
lm = linear_model.LogisticRegression(multi_class='ovr', solver='liblinear')
lm.fit(x_train, y_train)
predict_test=lm.predict(x_test)
print(lm.score(x_test, y_test))
You should leave the 'action1_preflop
' out of the 'cat_features
' dataframe and include it in the 'num_features
' dataframe:
cat_features = df_raw.select_dtypes(include=[object])
cat_features = cat_features.drop(['action1_preflop'], axis=1)
num_features = df_raw.select_dtypes(exclude=[object])
num_features = pd.concat([num_features, df_raw['action1_preflop']