Search code examples
pythonkeraslstmembeddinglanguage-model

Understanding Character Level Embedding in Keras LSTM


I am a newbie in implementation of language models in Keras RNN structures. I have a dataset of discrete words (not from a single paragraph) that have the following statistics,

  1. Total word samples: 1953
  2. Total number of Distinct Characters: 33 (including START,END and *)
  3. Maximum length (number of characters) in a word is 10

Now, I want to build a model that will accept a character and predict the next character in the word. I have padded all the words so that they have same length. So my input is Word_input with shape 1953 x 9 and target is 1953 x 9 x 33. I also want to use Embedding layer. So my network architecture is,

    self.wordmodel=Sequential()
    self.wordmodel.add(Embedding(33,embedding_size,input_length=9))
    self.wordmodel.add(LSTM(128, return_sequences=True))
    self.wordmodel.add(TimeDistributed(Dense(33)))
    self.wordmodel.compile(loss='mse',optimizer='rmsprop',metrics=['accuracy'])

As an example a word "CAT" with padding represents

Input to Network -- START C A T END * * * * (9 Characters)

Target of the same --- C A T END * * * * *(9 Characters)

So with the TimeDistributed output I am measuring the difference of network prediction and target. I have also set the batch_size to 1. So that after reading every sample word the network reset its state.

My question is am I doing it conceptually right? Whenever I am running my training the accuracy is stuck about 56%.

Kindly enlighten me. Thanks.


Solution

  • In my knowledge, the structure is basic and may work to some degree. I have some suggestions

    1. In the TimeDistributed layer, you should add an activation function softmax which is wide employed in multi-classification. And now in your structure, the output is non-limited and it's not intuitive as your target is just one-hot.

    2. With softmax function, you could change the loss to cross-entropy which increase the probability of correct class and decrease the others. It's more appropriate.

    you can take a try. For more useful model, you could try following structure which is given in Pytorch tutorial. Thanks.

    enter image description here