python-2.7 machine-learning scikit-learn sklearn-pandas

Divide the testing set into subgroup, then make prediction on each subgroup separately

I have a dataset similar to the following table:

The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.

Now what I have is as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
    y_new=y_test[(y_test>=i) & (y_test<=i+1)]
    y_new_pred=model.predict(X_test)
    print metrics.r2_score(y_new, y_new_pred)

However, my code did not work and this is the traceback that I get:

Found input variables with inconsistent numbers of samples: [14279, 55955]

I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?

Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.

Solution

When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).

Basically you are trying to compute

metrics.r2_score([1,3],[1,2,3,4,5])

which creates an error,

ValueError: Found input variables with inconsistent numbers of samples: [2, 5]

Hence, my suggested solution would be

model.fit(X_train, y_train)
#compute the prediction only once. 
y_pred = model.predict(X_test)

for i in (0,1,2,3,4):
    #COMPUTE THE CONDITION FOR SUBSET HERE
    subset = (y_test>=i) & (y_test<=i+1)
    print metrics.r2_score(y_test [subset], y_pred[subset])