So, im using the superconductivity dataset found here... It contains 82 variables and I am subsetting the data to 2000 rows. But when I use xgboost
with mlr3
it does not calculate the importance for all the variables!?
Here's how I'm setting everything up:
# Read in data
mydata <- read.csv("/Users/.../train.csv", sep = ",")
data <- mydata[1:2000,]
# set up xgboost using mlr3
myTaskXG = TaskRegr$new(id = "data", backend = data, target = "critical_temp")
myLrnXG = lrn("regr.xgboost")
myModXG <- myLrnXG$train(myTaskXG)
# Take a look at the importance
myLrnXG$importance()
this outputs something like this:
wtd_mean_FusionHeat std_ThermalConductivity entropy_Density
0.685125173 0.105919410 0.078925149
wtd_gmean_FusionHeat wtd_range_atomic_radius entropy_FusionHeat
0.038797205 0.038461823 0.020889094
wtd_mean_Density wtd_std_FusionHeat gmean_ThermalConductivity
0.017211730 0.006662321 0.005598844
wtd_entropy_ElectronAffinity wtd_entropy_Density
0.001292733 0.001116518
As you can see, there are only 11 variables there... when there should be 81.... if I do a similar process using ranger
, everything works perfectly.
Any suggestions as to what is happening?
Short answer: {xgboost} does not return all variables.
Longer answer:
This is not a mlr3
question but one about the xgboost
package. The importance
method from this learner simply calls xgboost::xgb.importance
. If you look at the example on this page:
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
xgb.importance(model = bst)
This returns
> xgb.importance(model = bst)
Feature Gain Cover Frequency
1: odor=none 0.67615471 0.4978746 0.4
2: stalk-root=club 0.17135375 0.1920543 0.2
3: stalk-root=rooted 0.12317236 0.1638750 0.2
4: spore-print-color=green 0.02931918 0.1461960 0.2
But there are 127 variables in the total dataset.
The maths behind this is just that ranger and xgboost use different importance methods, xgboost only includes the features actually used in the fitted model, whereas ranger uses impurity or permutation and considers all features at all splits.
By the way next time please provide a reprex (short reproducible example using easily accessible data and code).