Search code examples
python-2.7machine-learningscikit-learnsklearn-pandas

Divide the testing set into subgroup, then make prediction on each subgroup separately


I have a dataset similar to the following table: enter image description here

The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.

Now what I have is as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
    y_new=y_test[(y_test>=i) & (y_test<=i+1)]
    y_new_pred=model.predict(X_test)
    print metrics.r2_score(y_new, y_new_pred)

However, my code did not work and this is the traceback that I get:

Found input variables with inconsistent numbers of samples: [14279, 55955]

I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?

Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.


Solution

  • When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).

    Basically you are trying to compute

    metrics.r2_score([1,3],[1,2,3,4,5])
    

    which creates an error,

    ValueError: Found input variables with inconsistent numbers of samples: [2, 5]

    Hence, my suggested solution would be

    model.fit(X_train, y_train)
    #compute the prediction only once. 
    y_pred = model.predict(X_test)
    
    for i in (0,1,2,3,4):
        #COMPUTE THE CONDITION FOR SUBSET HERE
        subset = (y_test>=i) & (y_test<=i+1)
        print metrics.r2_score(y_test [subset], y_pred[subset])