Search code examples
mlr3

Error in PipeOp high cardinality factor encoding using mlr3: Invalid 'col_roles' Names


I am trying to train a glmnet model using data that includes a categorical variable, with the mlr3 package. Since the categorical variable has 27 levels, I considered it as a high-cardinality feature and used impact encoding. However, I am getting the following error message:

Error in .__Task__col_roles(self = self, private = private, super = super,  : 
  Assertion on 'names of col_roles' failed: Names must be a permutation of set {'feature','target','name','order','stratum','group','weight','coordinate','space','time'}, but has extra elements {'always_included'}.
This happened PipeOp high_cardinality_encoding's $train()

Here is my code:

data <- read.csv("C:/Users/test.csv")
data$presence <- as.factor(data$presence)
data$habitat <- as.factor(data$habitat)
classif_task_sp <- mlr3spatial::as_task_classif_st(id = "A1", x = data[, which(!(names(data) %in% c("ID", "year")))], target = "presence", positive = "1", 
                                                   coordinate_names = c("x", "y"), crs = "EPSG:4326", coords_as_features = FALSE)
classif_task_sp$set_col_roles("presence", roles = c("target", "stratum"))
partition_classif_task_sp <- mlr3::partition(classif_task_sp, ratio = 0.67)

factor_encoding <- mlr3pipelines::po("removeconstants") %>>%
  ## mlr3pipelines::po("collapsefactors", no_collapse_above_prevalence = 0.01) %>>%
  mlr3pipelines::po("encodeimpact", affect_columns = selector_cardinality_greater_than(10), id = "high_cardinality_encoding") %>>%
  mlr3pipelines::po("encode", method = "one-hot", affect_columns = selector_cardinality_greater_than(3), id = "low_cardinality_encoding") %>>%
  mlr3pipelines::po("encode", method = "treatment", affect_columns = selector_type("factor"), id = "binary_encoding")

learner_glmnet <- mlr3tuningspaces::lts(mlr3::lrn("classif.glmnet", predict_type = "prob", standardize = FALSE))
learner_glmnet_factor_encoding <- mlr3::as_learner(factor_encoding %>>% learner_glmnet)

tuning <- mlr3tuning::auto_tuner(tuner = mlr3tuning::tnr("grid_search", resolution = 5, batch_size = 10),
                                 learner = learner_glmnet_factor_encoding,
                                 resampling = mlr3::rsmp("spcv_coords", folds = 2),
                                 measure = mlr3::msr("classif.prauc"),
                                 terminator = mlr3tuning::trm("evals", n_evals = 2, k = 0))

run_resampling <- mlr3::resample(classif_task_sp, learner = tuning, resampling = mlr3::rsmp("spcv_coords", folds = 2), store_models = TRUE)

run_training <- tuning$train(classif_task_sp, row_ids = partition_classif_task_sp$train)

Here is the dataset: https://www.dropbox.com/scl/fi/rfyj9oav5z5yipmkr4a9q/test.csv?rlkey=vsfsyhfgh4svnoos5z6t18u5q&st=vak2wayv&dl=0

Update: I updated the packages to have mlr3 0.21.1 and mlr3fselect 1.2.1, but I'm still getting the error message: enter image description here


Solution

  • Can you try to update your packages? This was a bug that was solved with mlr3 0.21.1 and mlr3fselect 1.2.1.