machine-learning deep-learning deeplearning4j

Autoencoder for Character Time-Series with deeplearning4j

I'm trying to create and train an LSTM Autoencoder on character sequences (strings). This is simply for dimensionality reduction, i.e. to be able to represent strings of up to T=1000 characters as fixed-length vectors of size N. For the sake of this example, let N = 10. Each character is one-hot encoded by arrays of size validChars (in my case validChars = 77).

I'm using ComputationalGraph in be able to later remove decoder layers and use remaining for encoding. By looking at dl4j-examples I have come up with this:

    ComputationGraphConfiguration conf = new NeuralNetConfiguration.Builder()
            .seed(12345)
            .l2(0.0001)
            .weightInit(WeightInit.XAVIER)
            .updater(new Adam(0.005))
            .graphBuilder()
            .addInputs("input")
            .addLayer("encoder1", new LSTM.Builder().nIn(dictSize).nOut(250)
                    .activation(Activation.TANH).build(), "input")
            .addLayer("encoder2", new LSTM.Builder().nIn(250).nOut(10)
                    .activation(Activation.TANH).build(), "encoder1")

            .addVertex("fixed", new PreprocessorVertex(new RnnToFeedForwardPreProcessor()), "encoder2")
            .addVertex("sequenced", new PreprocessorVertex(new FeedForwardToRnnPreProcessor()), "fixed")

            .addLayer("decoder1", new LSTM.Builder().nIn(10).nOut(250)
                    .activation(Activation.TANH).build(), "sequenced")
            .addLayer("decoder2", new LSTM.Builder().nIn(250).nOut(dictSize)
                    .activation(Activation.TANH).build(), "decoder1")

            .addLayer("output", new RnnOutputLayer.Builder()
                    .lossFunction(LossFunctions.LossFunction.MCXENT)
                    .activation(Activation.SOFTMAX).nIn(dictSize).nOut(dictSize).build(), "decoder2")

            .setOutputs("output")
            .backpropType(BackpropType.TruncatedBPTT).tBPTTForwardLength(tbpttLength).tBPTTBackwardLength(tbpttLength)
            .build();

With this, I expected the number of features to follow the path: [77,T] -> [250,T] -> [10,T] -> [10] -> [10,T] -> [250, T] -> [77,T]

I have trained this network, and removed decoder part like so:

    ComputationGraph encoder = new TransferLearning.GraphBuilder(net)
            .setFeatureExtractor("fixed")
            .removeVertexAndConnections("sequenced")
            .removeVertexAndConnections("decoder1")
            .removeVertexAndConnections("decoder2")
            .removeVertexAndConnections("output")
            .addLayer("output", new ActivationLayer.Builder().activation(Activation.IDENTITY).build(), "fixed")
            .setOutputs("output")
            .setInputs("input")
            .build();

But, when I encode a string of length 1000 with this encoder, it outputs an NDArray of shape [1000, 10], instead of 1-dimensional vector of length 10. My purpose is to represent the whole 1000 character sequence with one vector of length 10. What am I missing?

Solution

Nobody answered the question, I found the answer in the dl4j-examples. So, will post it anyway in case it might be helpful to someone.

The part between encoder and decoder LSTMs should look like so:

            .addVertex("thoughtVector",
                    new LastTimeStepVertex("encoderInput"), "encoder")
            .addVertex("duplication",
                    new DuplicateToTimeSeriesVertex("decoderInput"), "thoughtVector")
            .addVertex("merge",
                    new MergeVertex(), "decoderInput", "duplication")

It is important that we do many-to-one by using LastTimeStep, and then we do one-to-many by using DuplicateToTimeSeries. This way 'thoughtVector' actually is a single vector representation of the whole sequence.

See full example here: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/encdec/EncoderDecoderLSTM.java, but note that example deals with word-level sequences. My net above works with character-level sequences, but the idea is the same.