Search code examples
rlinear-regressionr-caretgbmlasso-regression

Comparison of regression models in terms of the importance of variables


I would like to compare models (multiple regression, LASSO, Ridge, GBM) in terms of the importance of variables. But I'm not sure if the procedure is correct, because the values ​​obtained are not on the same scale.

In multiple regression and GBM values ​​range from 0 - 100 using varImp from the caret package. The calculation of this statistic is distinct in each of the methods.

Linear Models: the absolute value of the t-statistic for each model parameter is used.

Boosted Trees: this method uses the same approach as a single tree, but sums the importance of each boosting iteration.

While for LASSO and Ridge the values ​​are from 0.00 - 0.99, calculated with the function:

varImp <- function (object, lambda = NULL, ...) {
  beta <- predict (object, s = lambda, type = "coef")
  if (is.list (beta)) {
    out <- do.call ("cbind", lapply (beta, function (x)
      x [, 1])))
    out <- as.data.frame (out)
  } else
    out <- data.frame (Overall = beta [, 1])
  out <- abs (out [rownames (out)! = "(Intercept)",, drop = FALSE])
  out
}

Which was obtained here: Caret package - glmnet variable importance

I was guided by other questions on the forum, but could not understand why there is the difference between the scales. How can I make these measurements comparable?


Solution

  • If the goal is simply to compare them side-by-side, then what matters is creating a scale that they can all inhabit together, and sorting them.

    You can accomplish this by creating a standardized scale, and coercing all of your VarImps to the new consistent scale, in this case 0 to 100.

    
    importance_data <- c(-23,12, 32, 18, 45, 1, 77, 18, 22)
    
    new_scale <- function(x){
        y =((100-0)/(max(x) -min(x))*(x-max(x))+100)
        sort(y)
        }
    
    new_scale(importance_data)
    
    
    #results
    [1]   0  24  35  41  41  45  55  68 100
    

    This will give you a uniform scale. And it does not mean that 22 in one scale is exactly the same as a 22 in another scale. But for relative comparison, any scale will do.

    This will give you a standardized sense of the separation between the importance of each variable in its own model and you can evaluate them side-by-side more easily based on the relativity of the scaled importances.