Below are the code and outcomes. There are 2 models: one with Bidirectional. My questions is why # of parameters (264) time_distributed_14 (TimeDis is not doubled of time_distributed_13 (TimeDis (136)? I know 264 = 136 * 2 - 8. why do we need to -8 here?
from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed, Bidirectional
from keras.layers.recurrent import GRU
import numpy as np
InputSize = 15
MaxLen = 64
HiddenSize = 16
OutputSize = 8
n_samples = 1000
model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model2 = Sequential()
model2.add(Bidirectional(GRU(HiddenSize, return_sequences=True), input_shape=(MaxLen, InputSize)))
model2.add(TimeDistributed(Dense(OutputSize)))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')
print(model1.summary())
print(model2.summary())
Outcome:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
gru_9 (GRU) (None, 64, 16) 1536
_________________________________________________________________
time_distributed_13 (TimeDis (None, 64, 8) 136
_________________________________________________________________
activation_6 (Activation) (None, 64, 8) 0
=================================================================
Total params: 1,672
Trainable params: 1,672
Non-trainable params: 0
_________________________________________________________________
None
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bidirectional_7 (Bidirection (None, 64, 32) 3072
_________________________________________________________________
time_distributed_14 (TimeDis (None, 64, 8) 264
_________________________________________________________________
activation_7 (Activation) (None, 64, 8) 0
=================================================================
Total params: 3,336
Trainable params: 3,336
Non-trainable params: 0
_________________________________________________________________
None
There aren't only "weights" there are "biases" too, and biases completely ignore the inputs .
weights = input * output
- regular: = 16*8 = 128
- bidirec: = 32*8 = 256
biases = output
- regular: = 8
- bidirec: = 8
parameters = weights + biases
- regular: = 128 + 8 = 136
- bidirec: = 256 + 8 = 264