Search code examples
machine-learningdeep-learningkerasautoencoder

Keras Denoising Autoencoder (tabular data)


I have a project where I am doing a regression with Gradient Boosted Trees using tabular data. I want to see if using a denoising autoencoder on my data can find a better representation of my original data and improve my original GBT scores. Inspiration is taken from the popular Kaggle winner here.

AFAIK I have two main choices for extracting the activation's of the DAE - creating a bottleneck structure and taking the single middle layer activations or concatenating every layer's activation's as the representation.

Let's assume I want all layer activations from the 3x 512 node layers below:

inputs = Input(shape=(31,))
encoded = Dense(512, activation='relu')(inputs)
encoded = Dense(512, activation='relu')(encoded)
decoded = Dense(512, activation='relu')(encoded)
decoded = Dense(31, activation='linear')(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer='Adam', loss='mse')

history = autoencoder.fit(x_train_noisy, x_train_clean,
                epochs=100,
                batch_size=128,
                shuffle=True,
                validation_data=(x_test_noisy, x_test_clean),
                callbacks=[reduce_lr])

My questions are:

  • Taking the activations of the above will give me a new representation of x_train, right? Should I repeat this process for x_test? I need both to train my GBT model.

  • How can I do inference? Each new data point will need to be "converted" into this new representation format. How can I do that with Keras?

  • Do I actually need to provide validation_data= to .fit in this situation?


Solution

  • Taking the activations of the above will give me a new representation of x_train, right? Should I repeat this process for x_test? I need both to train my GBT model.

    Of course, you need to have the denoised representation for both training and testing data, because the GBT model that you train later only accepts the denoised feature.

    How can I do inference? Each new data point will need to be "converted" into this new representation format. How can I do that with Keras?

    If you want to use the denoised/reconstructed feature, you can directly use autoencoder.predict( X_feat ) to extract features. If you want to use the middle layer, you need to build a new model encoder_only=Model(inputs, encoded) first and use it for feature extraction.

    Do I actually need to provide validation_data= to .fit in this situation?

    You'd better separate some training data for validation to prevent overfitting. However, you can always train multiple models, e.g. in a leave-one-out way to fully use all data in an ensemble way.

    Additional remarks:

    • 512 hidden neurons seems to be too many for your task
    • consider to use DropOut
    • be careful about tabular data, especially when data in different columns are of different dynamic ranges (i.e. MSE does not fairly quantize the reconstruction errors of different columns).