Search code examples
pythonscikit-learnlogistic-regression

LogisticRegression() vs LogisticRegressionCV() and its Cs hyper-parameter


I've built a model using LogisticRegression() and after a grid search the data suggests for my inverse of regularization strength, C = .0000001 is the "best" value to make my predictions.

This parameter works fine for LogisticRegression(), but seeing as I want to cross-validate I decide to use LogisticRegressionCV() the equivalent c parameter here is denoted as Cs, yet when I try to pass the same variable Cs = .0000001, I get an error:

    797     warm_start_sag = {"coef": np.expand_dims(w0, axis=1)}
    799 coefs = list()
--> 800 n_iter = np.zeros(len(Cs), dtype=np.int32)
    801 for i, C in enumerate(Cs):
    802     if solver == "lbfgs":

TypeError: object of type 'float' has no len()

When referring to the documents it seems that for LogisticRegressionCV():

If Cs is as an int, then a grid of Cs values are chosen in a logarithmic scale between 1e-4 and 1e4.

How would I then still input a value of Cs = .0000001? I'm confused about how to proceed.


Solution

  • LogisticRegressionCV is not meant to be just cross-validation-scored logistic regression; it is a hyperparameter-tuned (by cross-validation) logistic regression. That is, it tries several different regularization strengths, and selects the best one using cross-validation scores (then refits a single model on the entire training set, using that best C). Cs can be a list of values to try for C, or an integer to let sklearn create a list for you (as in your quoted doc).

    If you just want to score your model with fixed C, use cross_val_score or cross_validate.

    (You probably can use LogisticRegressionCV, setting Cs=[0.0000001], but it's not the right semantic usage.)