I am going to apply a negative binomial regression model on the dataset and examine the model scores and the features' weight and significance using cross-validation (K-Fold). Here is the dataframe after applying the MinMax scaler. w4 is a categorial variable.
data.head()
w1 w2 w3 w4 Y
0 0.17 0.44 0.00 2004 1
1 0.17 0.83 0.22 2004 0
2 0.00 1.00 0.34 2005 0
3 1.00 0.00 1.00 2005 1
4 1.00 0.22 0.12 2006 3
I used the following code to get the score of the trained model on the test dataset, but it seems there is a problem in addressing the train and test dataset for the model. I appreciate if anyone can help.
scores = []
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
for train, test in kfold.split(data):
model = smf.glm(formula = "Y ~ w1 + w2 + w3 + C(w4)", data=X.iloc[train,:], family=sm.families.NegativeBinomial()).fit()
scores = scores.append(model.get_prediction(X.iloc[test,:])
print(scores)
Have you defined the X nad Y? It seems that you are passing the data
DataFrame to the kfold.split
method, yet you later reference the X and Y as data objects. Try setting up X = data[['w1', 'w2', 'w3', 'w4']]
first, and then reference them as you did in your example.
Also, I noticed that you overwrite your original scores
list in scores = model.get_prediction(X.iloc[test,:])
For instance:
X = data[['w1', 'w2', 'w3', 'w4']].values
Y = data['Y'].values
preds, scores = [], []
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
for train_idx, test_idx in kfold.split(data):
X_train, X_test = X[train_idx], X[test_idx]
y_test = Y[test_idx]
model = smf.glm(formula = "Y ~ w1 + w2 + w3 + C(w4)",
data=X_train,
family=sm.families.NegativeBinomial()).fit()
preds.append(model.get_prediction(X_test))
scores.append(model.score(X_test, y_test))
print(scores)