Search code examples
pythonmachine-learningscikit-learnlogistic-regression

Why over-sampling in pipeline explodes the number of model coefficients?


I have a model pipeline like this:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define preprocessor
preprocess = make_column_transformer(
    (StandardScaler(), ['attr1', 'attr2', 'attr3', 'attr4', 'attr5', 
                        'attr6', 'attr7', 'attr8', 'attr9']),
    (OneHotEncoder(categories='auto'), ['attrcat1', 'attrcat2'])
)

# define train and test datasets
X_train, X_test, y_train, y_test = 
    train_test_split(features, target, test_size=0.3, random_state=0)

When I execute the pipeline without over-sampling I get:

# don't do over-sampling in this case
os_X_train = X_train
os_y_train = y_train

print('Training data is type %s and shape %s' % (type(os_X_train), os_X_train.shape))
logreg = LogisticRegression(penalty='l2',solver='lbfgs',max_iter=1000)
model = make_pipeline(preprocess, logreg)
model.fit(os_X_train, np.ravel(os_y_train))
print("The coefficients shape is: %s" % logreg.coef_.shape)
print("Model coefficients: ", logreg.intercept_, logreg.coef_)
print("Logistic Regression score: %f" % model.score(X_test, y_test))

The output is:

Training data is type <class 'pandas.core.frame.DataFrame'> and shape (87145, 11)
The coefficients shape is: (1, 47)
Model coefficients:  [-7.51822124] [[ 0.10011794  0.10313989 ... -0.14138371  0.01612046  0.12064405]]
Logistic Regression score: 0.999116

Meaning I get 47 model coefficients for a training set of 87145 samples which makes sense taking into account the defined preprocessing. The OneHotEncoder works on attrcat1 and attrcat2 and they have a total of 31+7 categories which adds 38 columns plus the 9 columns I already had makes a total of 47 features.

Now if I do the same but this time over-sampling using SMOTE like this:

from imblearn.over_sampling import SMOTE
# balance the classes by oversampling the training data
os = SMOTE(random_state=0)
os_X_train,os_y_train=os.fit_sample(X_train, y_train.ravel())
os_X_train = pd.DataFrame(data=os_X_train, columns=X_train.columns)
os_y_train = pd.DataFrame(data=os_y_train, columns=['response'])

The output becomes:

Training data is type <class 'pandas.core.frame.DataFrame'> and shape (174146, 11)
The coefficients shape is: (1, 153024)
Model coefficients:  [12.02830778] [[ 0.42926969  0.14192505 -1.89354062 ...  0.008847    0.00884372 -8.15123962]]
Logistic Regression score: 0.997938

In this case I get about twice the training sample size to balance the response classes which is what I wanted but my logistic regression model explodes to 153024 coefficients. This doesn't make any sense ... any ideas why?


Solution

  • OK I found the culprit for this problem. The issue is that the SMOTE converts all the feature columns to float (including those two categorical features). Therefore, when applying the columns transformer OneHotEncoder on column types float explodes the number of columns to the number of samples i.e. it sees each occurrence of the same float value as a different category.

    The solution was simply to type convert those categorical columns back to int before running the pipeline:

    # balance the classes by over-sampling the training data
    os = SMOTE(random_state=0)
    os_X_train, os_y_train = os.fit_sample(X_train, y_train.ravel())
    os_X_train = pd.DataFrame(data=os_X_train, columns=X_train.columns)
    # critically important to have the categorical variables from float back to int
    os_X_train['attrcat1'] = os_X_train['attrcat1'].astype(int)
    os_X_train['attrcat2'] = os_X_train['attrcat2'].astype(int)