pandas deep-learning h2o autoencoder one-hot-encoding

Time issues with training a H2o Autoencoder

I was wondering if anyone knows if there are any glaring problems with the way my H2o autoencoder is being trained that could cause it to take so long? Or if anyone knows of any way I can reduce the time it takes to train this model, both with the dataset and the model construction. Any help would be greatly appreciated! Thank you very much!

I have been training a H2o autoencoder on a dataset consisting of just one-hot-encoded categorical columns. The dataset is of shape (7762,2232) and the model took about 5 hours to train. The code for building the model is as follows:

model = H2ODeepLearningEstimator(
    autoencoder = True,
    seed = -1
    hidden = [2000,1000,500,250,125,50],
    epochs = 30,
    activation = "Tanh"
)

Solution

The problem here is the number of columns. While the number of rows control the overall training time, the number of columns control the training time per row. Having 2232 is quite a lot. If you can do some data munging and reduce the number of predictors you use, it will definitely speed up training.

You can also try the following:

set stopping_tolerance to a higher number: 0.1 or higher. This will enable early stopping to stop training if the average improvement in some metrics does not improve by 0.1 compared to the last one;
set max_runtime_secs=120 if you want to stop the model building after 120 seconds
reduce score_training_samples from default of 10000 to say 5000. This will perform scoring on a smaller number of samples and hence can reduce training time.

Note that stopping the model early as in 1, 2 may reduce the model training time but will get you a model that may not be a good fit for your data.