python scikit-learn logistic-regression grid-search gridsearchcv

Logistic Regression Model using Regularization (L1 / L2) Lasso and Ridge

I am trying to build model and create the grid search and below is the code. Raw data is downloaded from this site(credit card fraud data). https://www.kaggle.com/mlg-ulb/creditcardfraud

Code starting from standardization after reading the data.

standardization = StandardScaler()
credit_card_fraud_df[['Amount']] = standardization.fit_transform(credit_card_fraud_df[['Amount']])
# Assigning feature variable to X
X = credit_card_fraud_df.drop(['Class'], axis=1)

# Assigning response variable to y
y = credit_card_fraud_df['Class']
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)
X_train.head()
power_transformer = PowerTransformer(copy=False)
power_transformer.fit(X_train)                       ## Fit the PT on training data
X_train_pt_df = power_transformer.transform(X_train)    ## Then apply on all data
X_test_pt_df = power_transformer.transform(X_test)
y_train_pt_df = y_train
y_test_pt_df = y_test
train_pt_df = pd.DataFrame(data=X_train_pt_df, columns=X_train.columns.tolist())
# set up cross validation scheme
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 4)

# specify range of hyperparameters
params = {"C":np.logspace(-3,3,5,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge

## using Logistic regression for class imbalance
model = LogisticRegression(class_weight='balanced')
grid_search_cv = GridSearchCV(estimator = model, param_grid = params, 
                        scoring= 'roc_auc', 
                        cv = folds, 
                        return_train_score=True, verbose = 1)            
grid_search_cv.fit(X_train_pt_df, y_train_pt_df)
## reviewing the results
cv_results = pd.DataFrame(grid_search_cv.cv_results_)
cv_results

Sample Result:

  mean_fit_time std_fit_time    mean_score_time std_score_time  param_C param_penalty   params  split0_test_score   split1_test_score   split2_test_score   split3_test_score   split4_test_score   mean_test_score std_test_score  rank_test_score
    0   0.044332    0.002040    0.000000    0.000000    0.001   l1  {'C': 0.001, 'penalty': 'l1'}   NaN NaN NaN NaN NaN NaN NaN 6
    1   0.477965    0.046651    0.016745    0.003813    0.001   l2  {'C': 0.001, 'penalty': 'l2'}   0.485714    0.428571    0.542857    0.485714    0.457143    0.480000    0.037904    5

I do not have any null values in the input data.I am not understanding why am i getting Nan values for these columns. Can anyone please help me?

Solution

You have a problem with default solver defined here:

model = LogisticRegression(class_weight='balanced')

which follows from the following error message:

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

Also, it might be useful to study docs prior to defining a param grid:

penalty: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’ Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties. ‘elasticnet’ is only supported by the ‘saga’ solver. If ‘none’ (not supported by the liblinear solver), no regularization is applied.

Al soon as you correct it with a different solver that supports your desired grid, you're fine to go:

## using Logistic regression for class imbalance
model = LogisticRegression(class_weight='balanced', solver='saga')
grid_search_cv = GridSearchCV(estimator = model, param_grid = params, 
                        scoring= 'roc_auc', 
                        cv = folds, 
                        return_train_score=True, verbose = 1)            
grid_search_cv.fit(X_train_pt_df, y_train_pt_df)
## reviewing the results
cv_results = pd.DataFrame(grid_search_cv.cv_results_)

Note as well a ConvergenceWarning which might suggest you need to increase default max_iter, tol, or switch to another solver and rethink the desired param grid.