When I run H2O autoencoder on two different data sets of about the same size (see below), I can finish one data set (A) within 5 minutes, but the other data set (B) is really slow. It takes >30 minutes to complete only 1% for data set B. I tried restarting R session and H2O a couple of times, but that didn’t help. There are about the same number of parameters (or coefficients) in the model for both data sets.
Data set A: 4 * 1,000,000 in size (<5 minutes)
Data set B: 8 * 477,613 in size (very slow)
The model below is used for both data sets:
model.dl = h2o.deeplearning(x = x, training_frame = data.hex, autoencoder = TRUE, activation = "Tanh", hidden = c(25,25,25), variable_importances = TRUE)
The memory of the H2O cluster is 15GB for both data sets. The same computer is used (OS X 10.14.6, 16 GB memory). Below is some information about the versions of H2O and R.
H2O cluster version: 3.30.0.1
H2O cluster total nodes: 1
H2O cluster total memory: 15.00 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 16
H2O cluster healthy: TRUE
R Version: R version 3.6.3 (2020-02-29)
Please let me know if there is any other information I can provide to get this issue resolved.
This problem has been resolved.
The problem is that there are a lot more columns for data set B after one-hot-encoding during the model run. Please see below.
Data set A:
There are 4 categorical features. The number of unique values for these categorical features is 12, 14, 25, and 10, respectively.
Data set B:
There are 7 categorical features and 1 numerical feature. The number of unique values for the categorical features is 17, 49, 52, 85, 5032 (!), 18445 (!!) and 392124 (!!!), respectively. This explains why it's so slow.