Search code examples
pythonmachine-learningscikit-learnclassificationcross-validation

Logistic regression and cross-validation


I am trying to solve a classification problem on a given dataset, through logistic regression (and this is not the problem). To avoid overfitting I'm trying to implement it through cross-validation (and here's the problem): there's something that I'm missing to complete the program. My purpose here is to determine accuracy.

But let me be specific. This is what I've done:

  1. I split the set into train set and test set
  2. I defined the logregression prediction model to be used
  3. I used the cross_val_predict method (in sklearn.cross_validation) to make predictions
  4. Lastly, I measured accuracy

Here is the code:

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import train_test_split
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression
 
# read training data in pandas dataframe
data = pd.read_csv("./dataset.csv", delimiter=';')
# last column is target, store in array t
t = data['TARGET']
# list of features, including target
features = data.columns
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)

# define method
logreg=LogisticRegression()

# cross valitadion prediction
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
print(metrics.accuracy_score(t_train, predicted)) 

My problems:

  • From what I understand the test set should not be considered until the very end and cross-validation should be made on training set. That's why I inserted X_train and t_train in the cross_val_predict method. Thuogh, I get an error saying:

    ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]

    where 6016 is the number of samples in the whole dataset, and 4812 is the number of samples in the training set after the dataset has been split

  • After this, I don't know what to do. I mean: when do the X_test and t_test come into play? I don't get how I should use them after cross-validating and how to get the final accuracy.

Bonus question: I'd also like to perform scaling and reduction of dimensionality (through feature selection or PCA) within each step of the cross-validation. How can I do this? I've seen that defining a pipeline can help with scaling, but I don't know how to apply this to the second problem.


Solution

  • Here is working code tested on a sample dataframe. The first issue in your code is the target array is not an np.array. You also shouldn't have target data in your features. Below I illustrate how to manually split the training and testing data using train_test_split. I also show how to use the wrapper cross_val_score to automatically split, fit, and score.

    random.seed(42)
    # Create example df with alphabetic col names.
    alphabet_cols = list(string.ascii_uppercase)[:26]
    df = pd.DataFrame(np.random.randint(1000, size=(1000, 26)),
                      columns=alphabet_cols)
    df['Target'] = df['A']
    df.drop(['A'], axis=1, inplace=True)
    print(df.head())
    y = df.Target.values  # df['Target'] is not an np.array.
    feature_cols = [i for i in list(df.columns) if i != 'Target']
    X = df.ix[:, feature_cols].as_matrix()
    # Illustrated here for manual splitting of training and testing data.
    X_train, X_test, y_train, y_test = \
        model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
    
    # Initialize model.
    logreg = linear_model.LinearRegression()
    
    # Use cross_val_score to automatically split, fit, and score.
    scores = model_selection.cross_val_score(logreg, X, y, cv=10)
    print(scores)
    print('average score: {}'.format(scores.mean()))
    

    Output

         B    C    D    E    F    G    H    I    J    K   ...    Target
    0   20   33  451    0  420  657  954  156  200  935   ...    253
    1  427  533  801  183  894  822  303  623  455  668   ...    421
    2  148  681  339  450  376  482  834   90   82  684   ...    903
    3  289  612  472  105  515  845  752  389  532  306   ...    639
    4  556  103  132  823  149  974  161  632  153  782   ...    347
    
    [5 rows x 26 columns]
    [-0.0367 -0.0874 -0.0094 -0.0469 -0.0279 -0.0694 -0.1002 -0.0399  0.0328
     -0.0409]
    average score: -0.04258093018969249
    

    Helpful references: