Search code examples
machine-learningscikit-learnlogistic-regressionpolynomials

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores


I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).

I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?

Here is the code I'm using,

from sklearn import preprocessing
X_scaled = preprocessing.scale(X)

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)

from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)

array([ 0.46418338,  0.4269341 ,  0.49425287,  0.58908046,  0.60518732])

Which makes me suspect, I'm doing something wrong.

I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.

UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features. warnings.warn("Numerical issues were encountered "

So I didn't bother going this route.

The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.

Thank you.


Solution

  • Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside @MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.