Search code examples
pythonsklearn-pandasnon-linear-regressionk-fold

How can I use K-fold cross validation for negative binomial regression in sklearn?


I am going to apply a negative binomial regression model on the dataset and examine the model scores and the features' weight and significance using cross-validation (K-Fold). Here is the dataframe after applying the MinMax scaler. w4 is a categorial variable.

data.head()


     w1      w2      w3      w4     Y
0   0.17    0.44    0.00    2004    1   
1   0.17    0.83    0.22    2004    0   
2   0.00    1.00    0.34    2005    0
3   1.00    0.00    1.00    2005    1
4   1.00    0.22    0.12    2006    3

I used the following code to get the score of the trained model on the test dataset, but it seems there is a problem in addressing the train and test dataset for the model. I appreciate if anyone can help.

scores = []
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
for train, test in kfold.split(data):
    model = smf.glm(formula = "Y ~ w1 + w2 + w3 + C(w4)", data=X.iloc[train,:], family=sm.families.NegativeBinomial()).fit()
    scores = scores.append(model.get_prediction(X.iloc[test,:])
    
print(scores)

Solution

  • Have you defined the X nad Y? It seems that you are passing the data DataFrame to the kfold.split method, yet you later reference the X and Y as data objects. Try setting up X = data[['w1', 'w2', 'w3', 'w4']] first, and then reference them as you did in your example.

    Also, I noticed that you overwrite your original scores list in scores = model.get_prediction(X.iloc[test,:]) For instance:

    X = data[['w1', 'w2', 'w3', 'w4']].values
    Y = data['Y'].values
    preds, scores = [], []
    kfold = KFold(n_splits=10, shuffle=True, random_state=1)
    for train_idx, test_idx in kfold.split(data):
        X_train, X_test = X[train_idx], X[test_idx]
        y_test = Y[test_idx]
        model = smf.glm(formula = "Y ~ w1 + w2 + w3 + C(w4)", 
                        data=X_train, 
                        family=sm.families.NegativeBinomial()).fit()
        preds.append(model.get_prediction(X_test))
        scores.append(model.score(X_test, y_test))
    print(scores)