BiLSTM (Bidirectional Long Short-Term Memory Networks) with MLP(Multi-layer Perceptron)

I am trying to implement the network architecture of this paper Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks, by Ruiqing Yin, Herve Bredin, Claude Barras, which is as, enter image description here

The model is composed of two Bi-LSTM (Bi-LSTM 1 and 2) and a multi-layer perceptron (MLP) whose weights are shared across the sequence. B. Bi-LSTM1 has 64 outputs (32 forward and 32 backward). Bi-LSTM2 has 40 (20 each). The fully connected layers are 40-, 10- and 1-dimensional respectively. The output of both forward and backward LSTMs are concatenated and fed forward to the next layer. The shared MLP is made of three fully connected feedforward layers, using tanh activation function for the first two layers, and a sigmoid activation function for the last layer, in order to output a score between 0 and 1. I have taken reference from various sources and come up with following code,

model = Sequential()

model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(40, return_sequences=True)))
model.add(TimeDistributed(Dense(40,activation='tanh')))
model.add(TimeDistributed(Dense(10,activation='tanh')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))

model.build(input_shape=(None, 200, 35))
model.summary()

I am confused with TimeDistributed layer and how can it simulate an MLP, also how the weights are being shared, can you at least point out that whether I am doing right or not.

Solution

As the architecture in the paper suggests, you basically want to push each of the hidden states (which are themselves time distributed) into separate dense layers (thus forming an MLP at each time state).

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bidirectional (Bidirectional (None, 200, 128)          51200     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200, 80)           54080     
_________________________________________________________________
time_distributed (TimeDistri (None, 200, 40)           3240      
_________________________________________________________________
time_distributed_1 (TimeDist (None, 200, 10)           410       
_________________________________________________________________
time_distributed_2 (TimeDist (None, 200, 1)            11        
=================================================================
Total params: 108,941
Trainable params: 108,941
Non-trainable params: 0

The Bi-LSTM here is set to return_sequence = True. Therefore it returns the hidden state sequence to the subsequent layer. If you push this sequence into a Dense layer, it wouldn't make sense since you are going to return a 3D tensor (batch, time, feature). Now, if you want to form a Dense network at each time, you will need it to be Time distributed.

As the output shape suggests, this layer creates a 40 node layer at each of the 200 time steps that are the output of the Bi-LSTM before (hidden states). Each of these is then stacked with 10 node layer as well (None, 200, 10). Similarly, the logic follows.

If your doubt is what TimeDistributed layers are - as per official documentation.

This wrapper allows applying a layer to every temporal slice of an input.

The final goal is speaker change detection. Meaning that you want to predict the speaker or probability of a speaker at each of the 200 time steps. Therefore the output layer returns 200 logits (None, 200, 1).

Hope that solves your confusion.

Another intuitive way of looking at it -

Your Bi-LSTM is set to return sequences instead of just features. Each time step in this sequence that is returned needs to have a Dense network of its own. TimeDistributed Dense is basically a layer that takes in an input sequence and inputs it to separate dense nodes at each time step. So, instead of having 40 nodes like a standard Dense layer, it has 200 X 40 nodes, where the input to say the 3rd 40 nodes, is the 3rd time step from the Bi-LSTM. This simulates a time distributed MLP over the Bi-LSTM sequences.

A good visual intuition that I prefer when working with LSTMs -

If you DONT return sequences, the output of the LSTM is just a single value of ht (LHS of the image below)
If you return sequences, the output is a sequence (h0 to ht) (RHS of the image below)

Adding a Dense layer, in the first case will only take in ht as input. In the second case, you will need a TimeDistributed Dense, which will "stack" on top of each of the h0 to ht.