I am trying to gauge the impact of part of speech information with Word2Vec embeddings but am not obtaining expected results.
I expected POS included word2vec embeddings to perform better in a machine translation task but it is actually performing worse.
I am creating two sets of embedding off of the same corpus using Gensim, one is normal Word2Vec, the other, I am changing tokens to "[WORD]__[POS]".
I am gauging differences in performance by using the embeddings in a Seq2Seq machine translation task. I am evaluating the two approaches with BLEU
This is how I am training the word2vec + POS embeddings with SpaCy:
sentences = []
for sent in doc.sents:
tokens = []
for t in sent:
tokens += ["{}__{}".format(t.text, t.pos_)]
sentences += tokens
pos_train += [sentences]
This is my benchmark machine translation model with Keras + Tensorflow:
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(LATENT_DIM, return_state=True)
_, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(LATENT_DIM, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
With BLEU, the Word2Vec+POS approach consistently scores the same as Word2Vec or 0.01-0.02 points below the normal Word2Vec embeddings.
Does anyone know why this might be happening? Is there a gap in my reasoning or expectations?
I, too, would have expected accurate part-of-speech info to improve translation – but I don't know if others have reported such an improvement. Some (uninformed) conjectures as to why it might not:
maybe the POS tagging isn't very accurate with regard to one of the languages, or there are some other anomalous challenges specific to your data
maybe the method of creating composite tokens, with the internal __
, is in a few corner cases interfering with evaluation – for example, if the original corpus retains any tokens which already had __
in them
maybe for certain cases of insufficient data, the collision of similar-meaning homographs of different parts-of-speech actually helps thicken the vaguer meaning-to-meaning translation. (For example, maybe given the semantic relatedness of shop_NOUN
and shop_VERB
, it's better to have 100 colliding examples of shop
than 50 of each.)
Some debugging ideas (in addition to the obvious "double-check everything"):
look closely at exactly those test cases where the plain-vs-POS approaches differ in their scoring; see if there are any patterns – like strange tokens/punctuation, nonstandard grammar, etc – giving clues to where the __POS
decorations hurt
try other language pairs, and other (private or public) concordance datasets, to see if elsewhere (or in general) the POS-tagging does help, and there's something extra-challenging about your particular dataset/language-pair
consider that the multiplication-of-tokens (by splitting homographs into POS-specific variants) has changed the model size & word-distributions, in ways that might interact with other limits (like min_count
, max_vocab_size
, etc) in ways that modify training. In particular, perhaps the larger-vocabulary POS model should get more training epochs, or a larger word-vector dimensionality, to reflect its larger vocabulary with a lower average number of word-occurrences.
Good luck!