Search code examples
rmachine-learningdeep-learningh2oautoencoder

Why is H2O autoencoder so slow for one data set but not the other?


When I run H2O autoencoder on two different data sets of about the same size (see below), I can finish one data set (A) within 5 minutes, but the other data set (B) is really slow. It takes >30 minutes to complete only 1% for data set B. I tried restarting R session and H2O a couple of times, but that didn’t help. There are about the same number of parameters (or coefficients) in the model for both data sets.

Data set A: 4 * 1,000,000 in size (<5 minutes)

Data set B: 8 * 477,613 in size (very slow)

The model below is used for both data sets:

model.dl = h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)

The memory of the H2O cluster is 15GB for both data sets. The same computer is used (OS X 10.14.6, 16 GB memory). Below is some information about the versions of H2O and R.

H2O cluster version:        3.30.0.1
H2O cluster total nodes:    1
H2O cluster total memory:   15.00 GB
H2O cluster total cores:    16
H2O cluster allowed cores:  16
H2O cluster healthy:        TRUE
R Version:                  R version 3.6.3 (2020-02-29)

Please let me know if there is any other information I can provide to get this issue resolved.


Solution

  • This problem has been resolved.

    The problem is that there are a lot more columns for data set B after one-hot-encoding during the model run. Please see below.

    Data set A:

    There are 4 categorical features. The number of unique values for these categorical features is 12, 14, 25, and 10, respectively.

    Data set B:

    There are 7 categorical features and 1 numerical feature. The number of unique values for the categorical features is 17, 49, 52, 85, 5032 (!), 18445 (!!) and 392124 (!!!), respectively. This explains why it's so slow.