Search code examples
statisticsclassificationregressionensemble-learning

Sensitivity Vs Positive Predicted Value - which is best?


I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.

1. Should i give preference to Sensitivity or Positive Predicted Value. 
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.


2. My dataset has around 450K observations and 250 features. 
After power test i took 10K observations by Simple random sampling. While selecting 
variable importance using ensemble technique's the features 
are different compared to the features when i tried with 150K observations. 
Now with my intuition and domain knowledge i felt features that came up as important in 
150K observation sample are more relevant. what is the best practice?

3. Last, can i use the variable importance generated by RF in other ensemple 
techniques to predict the accuracy?

Can you please help me out as am bit confused on which w


Solution

  • The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/ Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.

    Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.

    Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test