Search code examples
tensorflowkerastime-seriesxgboostautoencoder

XGBoost RMSE higher after using AE features


I have time series data with 9 inputs, so I wanted to use an autoencoder to create new features since I can no longer create meaningful features manually. I run X thru the following autoencoder

i = Input(9)
encoded = BatchNormalization()(i)
encoded = Dense(256, activation='linear')(encoded)

decoded = Dense(input_dim, name='decoded')(decoded)
x = Dense(128, activation='linear')(decoded)

x = BatchNormalization()(x)
x = Dense(9, activation='swish', name='label_output')(x)

encoder = Model(inputs=i, outputs=decoded)
autoencoder = Model(inputs=i, outputs=[decoded, x])

autoencoder.compile(optimizer=tf.keras.optimizers.Adam(.0005), loss={'decoded':'mse','label_output':'mse'})

encoder.save_weights('encoder.hdf5')

I load the encoder:

encoder = encoder.load_weights('encoder.hdf5')

encodedX = encoder.predict(X)

I fit the encoded X to XGBRegressor and train it against y. The CV score it outputs is higher than it would be if I didnt encode the X at all. For example, encoded CV = -13, while normal CV is -0.09. Could someone point me in the right direction? Thanks.


Solution

  • Autoencoder do not add any features, they learn to copy the input in a non trivial manner (ie: not by learning the identity function but by learning an "abstract" representation of your inputs that the decoder stage use to generate new input).

    First you should assess how well your autoencoder (AE) fares at his job. AE can be especially tricky to train, give them to much (or too little) capacity and you will end up with the identity function, in which case your model we'll not be able to generate any "good" instances of X and that could explain the performance drop.

    Second, even if your AE seems to be doing a good job it might also be adding some noise (due to error in reconstruction) or have learned some pattern specific to your training set that do not generalize to others data.

    Hence you XGBRegressor would indirectly be overfitting your training set.