Search code examples
rh2odimensionality-reduction

Error when applying h2o::h2o.glrm() to mixed data


I would like reduce the dimensionality of a mixed data set with the help of the h2o.glrm() function from the R package h2o. My data set includes binary variables (nominal variables with two possible levels), nominal variables (with three or more possible levels), and ordinal variables (with three or more possible levels). I'm using logistic loss for binary variables and ordinal loss and categorical loss for ordinal variables and nominal variables, respectively.

Here is a minimal, reproducible example of my problem.

# Load packages
library(tibble)
library(h2o)

# Example data for MRE
my_data <- tibble::tibble(
  var.1 = as.factor(rep(1, 10)),
  var.2 = as.factor(c(NA, 1, 1, -1, -1, -1, 1, 1, 1, 1)),
  var.3 = as.factor(rep(-1, 10)),
  var.4 = as.factor(c(-1, 1, 1, 1, 1, 1, -1, 1, 1, 1)),
  var.5 = as.factor(rep(-1, 10)),
  var.6 = as.factor(c(1, 2, 3, 1, 2, 2, 2, 2, 2, 3)),
  var.7 = as.factor(c(NA, 2, 3, 2, 2, 2, 2, 3, 1, 2)),
  var.8 = as.factor(c(2, 3, 2, 2, 2, 2, 3, 2, 2, 2)),
  var.9 = as.factor(c(1, 2, 3, 4, 1, 2, 3, 4, 1, 3)),
  var.10 = as.factor(c(1, 1, 1, 1, NA, 1, 1, -1, -1, 1))
)

my_data_types <- tibble::tibble(
  var_name = paste("var", 1:10, sep = "."),
  var_type = c(rep("binary", 5),
               rep("ordinal", 3),
               "nominal", "binary")
)

# Initialize h2o cluster
h2o::h2o.init()
h2o::h2o.no_progress()

# Convert data to h2o object
my_data_h2o <- h2o::as.h2o(my_data)

# Define loss function for ordinal and nominal variables
losses <- tibble::tibble(
  index = which(my_data_types$var_type %in% c("ordinal", "nominal")) - 1,
  loss = NA_character_
)

for (i in seq_along(losses$index)) {
  losses$loss[i] <-
    ifelse(my_data_types$var_type[losses$index[i] + 1] == "ordinal", "Ordinal",
           ifelse(my_data_types$var_type[losses$index[i] + 1] == "nominal", "Categorical", NA))
}

# Run GLRM
my_glrm <- h2o::h2o.glrm(
  training_frame = my_data_h2o,
  k = 2,
  loss = "Logistic",
  loss_by_col_idx = losses$index,
  loss_by_col = losses$loss,
  regularization_x = "None",
  regularization_y = "None",
  transform = "NONE",
  max_iterations = 2000,
  seed = 12345
)

When I run the above model, I receive the following error message:

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 
  
ERROR MESSAGE:

Illegal argument(s) for GLRM model: GLRM_model_R_1683532209346_20.  Details: ERRR on field: _loss: Logistic is not a numeric loss function

Although I don't think that this is what the error message tells me, I also ran the model on an alternative version of the data set in which binary variables are not defined as factors.

# Alternative example data for MRE
my_data_2 <- tibble::tibble(
  var.1 = rep(1, 10),
  var.2 = c(NA, 1, 1, -1, -1, -1, 1, 1, 1, 1),
  var.3 = rep(-1, 10),
  var.4 = c(-1, 1, 1, 1, 1, 1, -1, 1, 1, 1),
  var.5 = rep(-1, 10),
  var.6 = as.factor(c(1, 2, 3, 1, 2, 2, 2, 2, 2, 3)),
  var.7 = as.factor(c(NA, 2, 3, 2, 2, 2, 2, 3, 1, 2)),
  var.8 = as.factor(c(2, 3, 2, 2, 2, 2, 3, 2, 2, 2)),
  var.9 = as.factor(c(1, 2, 3, 4, 1, 2, 3, 4, 1, 3)),
  var.10 = c(1, 1, 1, 1, NA, 1, 1, -1, -1, 1)
)

# Convert data to h2o object
my_data_2_h2o <- h2o::as.h2o(my_data_2)

# Run GLRM
my_glrm_2 <- h2o::h2o.glrm(
  training_frame = my_data_2_h2o,
  k = 2,
  loss = "Logistic",
  loss_by_col_idx = losses$index,
  loss_by_col = losses$loss,
  regularization_x = "None",
  regularization_y = "None",
  transform = "NONE",
  max_iterations = 2000,
  seed = 12345
)

When I run the model on the alternative version of the data set, I receive the following error:

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 
  
ERROR MESSAGE:

Illegal argument(s) for GLRM model: GLRM_model_R_1683532209346_21.  Details: ERRR on field: _loss: Logistic is not a numeric loss function
ERRR on field: _loss_by_col: Loss function Logistic cannot be applied to numeric column 0
ERRR on field: _loss_by_col: Loss function Logistic cannot be applied to numeric column 1
ERRR on field: _loss_by_col: Loss function Logistic cannot be applied to numeric column 6

I would be very grateful if anyone could tell me what I'm doing wrong here.


Solution

  • The loss related function parameters are not in the right format, so it gets confused (and gives you the error) for applying a "improper" loss function to a given data type.

    Instead of passing loss = or loss_by_col_idx = , pass just loss_by_col = . This is designed to take a loss function name per feature in your training_frame so it needs to be the same length as ncol(my_data).

    losses2 = dplyr::case_when(
      my_data_types$var_type == 'binary' ~ 'Logistic',
      my_data_types$var_type == 'ordinal' ~ 'Ordinal',
      TRUE ~ 'Categorical')
    
    losses2
    
    # console:
    # [1] "Logistic"    "Logistic"    "Logistic"    "Logistic"    "Logistic"    "Ordinal"    
    # [7] "Ordinal"     "Ordinal"     "Categorical" "Logistic" 
    
    # Run GLRM
    my_glrm <- h2o::h2o.glrm(
      training_frame = my_data_h2o,
      k = 2,
      loss_by_col = losses2,
      regularization_x = "None",
      regularization_y = "None",
      transform = "NONE",
      max_iterations = 2000,
      seed = 12345
    )
    

    Now you're model is up and running, but dropping some no information features, just like we wanted.