I'm trying to predict word with recurrent neural network.
I'm training network by putting independently pre-trained word2vec
of words as input.
And I wonder if I can use word2vec
of target word to calculate error cost.
It seems not working and I've never seen such examples or papers.
Is it possible to use word2vec as a target value for calculating error cost?
If so, what kind of cost function should I use?
If not, please explain the reason mathematically.
And how should I set input and target? Now I'm using architecture like below :
input : word1, word2, word3, target : word4
input : word1, word2, word3, word4, target : word5
Maybe I can use another option like :
input : word1, word2 target : word2, word3
input : word1, word2, word3, target : word2, word3, word4
Which one is better? Or is there another option?
If there's any reference let me know.
The prediction is usually made through an output softmax layer that gives the probabilities for all words in the vocabulary.
However a recent paper suggests tying the input word vectors with the output word classifiers and training them end-to-end. This significantly reduces the number of parameters. https://arxiv.org/abs/1611.01462
With regards to architectures, atleast for training I would prefer the second option since the first one loses information about the second and third word that can also be used for training.