Search code examples
rexcelperformancecounting

Counting the # of correctly & overspecified regression models selected by a variable selection algorithm based on the sensitivity, FPR, & specificity


I have run k Backward Elimination & Forward Selection Stepwise Regressions on a corresponding set of k synthetic datasets and have written code to calculate the sensitivity (aka the True Positive Rate), the False Positive Rate, and the specificity (aka the True Negative Rate) for the variables selected by each of the k BE & FS regressions.

For the BE Stepwise case, I used the following code:

### Benchmark 2: Run a Backward Elimination Stepwise Regression
### function on each of the  csvs.
### Assign the full models to their corresponding csvs and
### store these in the object "all_regressors_models"
library(parallel)
CL <- makeCluster(detectCores() - 1L)
clusterExport(CL, c('datasets'))
set.seed(11)      # for reproducibility
system.time(BE.fits <- parLapply(CL, datasets, \(X) {
    full_models <- lm(X$Y ~ ., X)
    back <- step(full_models, scope = formula(full_models), 
                    direction = 'back', trace = FALSE) }) )

BE_Coeffs <- lapply(seq_along(BE.fits), function(i) coef(BE.fits[[i]]))
stopCluster(CL)

IVs_Selected_by_BE <- lapply(seq_along(BE.fits), 
                             \(i) names(coef(BE.fits[[i]])[-1]))


### Count up how many Variables Selected match  the true 
### structural equation variables for that dataset in order
### to measure BE's performance.
# the True Positive Rate
Total_Positives <- lapply(True_Regressors, function(i) { length(i) })  
BE_TPs <- lapply(seq_along(datasets), \(i)
                                    sum(IVs_Selected_by_BE[[i]] %in% 
                                          True_Regressors[[i]]))
BM2_TPRs = lapply(seq_along(datasets), \(j)
                  j <- (BE_TPs[[j]]/Total_Positives[[j]]) )


# the False Positive Rate
BE_NNs <- lapply(True_Regressors, function(i) {30 - length(i)})
BE_FPs <- lapply(seq_along(datasets), \(i)
                          sum(!(IVs_Selected_by_BE[[i]] %in% 
                                True_Regressors[[i]]))) 
BE_FPRs = lapply(seq_along(datasets), \(j)
                  j <- (BM1_FPs[[j]])/BM1_NPs[[j]])


# the True Negative Rate
BE_TNRs <- lapply(BE_FPRs, \(i) 
                   i <- (1 - i))

Note: the reason it is 30 - i is because each of the k datasets is 500 by 31, which 30 candidate variable/regressor/predictor columns and 1 dependent variable column.

What I need from here is 1., a way to count how many times the sensitivity is 1 and so is the specificity, because that number is equal to the number of models selected by Backwards Stepwise which are 'correctly specified' in the econometric jargon, meaning that there are no omitted are extraneous variables in them, and 2., a way to count how many times sensitivity = 1 & FPR > 0, because that is the number of selected models which are 'overspecified' in the jargon.

If I were using Excel, these would be rather straight forward using SUMIF functions.


Solution

  • First of all, remember that the formula for the False Positive Rate is FPR = FP/(FP + TN)

    So, using the dimensions of your data and following your notation, before you can calculate the FPR, you need to know the number of True Negatives each regression select. This performs this calculation:

    BM1_TNs <- lapply(seq_along(datasets), \(i) 30 - BE_TPs[[i]])
    

    From there, simply using the formula, you would calculate the FPR using:

    BM1_FPRs = lapply(seq_along(datasets), \(i)
                      i <- (BM1_FPs[[i]])/(BM1_FPs[[i]] + BM1_TNs[[i]]))
    

    As for counting up the total number of extraneous variable models (overspecified) selected, something like the following would do. First, convert the lists of performance metrics into vectors like so:

    BM1_TPRs <- unlist(BM1_TPRs)
    BM1_FPRs <- unlist(BM1_FPRs)
    BM1_TNRs <- unlist(BM1_TNRs)
    

    From there, all you need to do is execute the following line:

    Overspecified = sum((BM1_FPRs > 0) & (BM1_TPRs == 1) & (BM1_TNRs == 1))
    

    And as should be clear now, the number of correct models can be found via:

    Correct_Models = sum((BM1_TPRs == 1) & (BM1_FPRs == 0) & (BM1_TNRs == 1))