Search code examples
pythonpandasmachine-learningscikit-learngrid-search

Sklearn GridSearchCV using Pandas DataFrame Column


Im running a GridSearchCV (Grid Search Cross Validation) from the Sklearn Library on a SGDClassifier (Stochastic Gradient Descent Classifier). I'm using a DataFrame from Pandas for features and target. Here's the code :

from sklearn.grid_search import GridSearchCV
parameters = {'loss': [ 'hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'], 'alpha': [0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001], 'n_iter': list(np.arange(1,1001))}
clf = GridSearchCV(estimator = SGDClassifier(), param_grid = parameters, scoring = 'f1')
print(clf)
clf.fit(X_train, y_train)

Where's X_train is a 300 rows x 31 columns Pandas DataFrame with each column named by the following :

['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

And y_train is a 300 rows x 1 column Pandas Series named by the following :

['passed']

When I try the GridSearchCV algorithm, I'm getting the following error statement :

IndexError: too many indices for array

Solution

  • The code below prepares a random dataset which conforms to your definition:

    • X_train=300x31 DataFrame
    • y_train=300x1 Series with 2 classes, 0 and 1).

    With the X_train and y_train below your code works, so the problem may be in the data itself.

    import pandas as pd
    import numpy as np
    
    N = 300
    D = 31
    
    y_train = pd.Series([0,1]*(N/2))
    X_train = np.matrix(y_train).T.repeat(D, axis=1) + np.random.normal(size=(N, D))
    X_train = pd.DataFrame(X_train)
    

    Indeed, you mention the DataFrame has 31 columns, but the list of column names you provided only has 30 elements. The problem may be in the construction of X_train.

    (I've done the test with fewer parameters, here is the reduced version for reproducibility:)

    from sklearn.grid_search import GridSearchCV
    from sklearn.linear_model import SGDClassifier
    parameters = {'loss': [ 'hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'], 'alpha': [0.1, 0.01], 'n_iter': [1,2, 1000]}
    clf = GridSearchCV(estimator = SGDClassifier(), param_grid = parameters, scoring = 'f1')
    print(clf)
    clf.fit(X_train, y_train)