Search code examples
kerasnlpword2vecword-embeddingseq2seq

Word2Vec with POS not producing expected results?


I am trying to gauge the impact of part of speech information with Word2Vec embeddings but am not obtaining expected results.

I expected POS included word2vec embeddings to perform better in a machine translation task but it is actually performing worse.

I am creating two sets of embedding off of the same corpus using Gensim, one is normal Word2Vec, the other, I am changing tokens to "[WORD]__[POS]".

I am gauging differences in performance by using the embeddings in a Seq2Seq machine translation task. I am evaluating the two approaches with BLEU

This is how I am training the word2vec + POS embeddings with SpaCy:

sentences = []
    for sent in doc.sents:
        tokens = []
        for t in sent:
            tokens += ["{}__{}".format(t.text, t.pos_)]
        sentences += tokens
    pos_train += [sentences]

This is my benchmark machine translation model with Keras + Tensorflow:

encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(LATENT_DIM, return_state=True)
_, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(LATENT_DIM, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

With BLEU, the Word2Vec+POS approach consistently scores the same as Word2Vec or 0.01-0.02 points below the normal Word2Vec embeddings.

Does anyone know why this might be happening? Is there a gap in my reasoning or expectations?


Solution

  • I, too, would have expected accurate part-of-speech info to improve translation – but I don't know if others have reported such an improvement. Some (uninformed) conjectures as to why it might not:

    • maybe the POS tagging isn't very accurate with regard to one of the languages, or there are some other anomalous challenges specific to your data

    • maybe the method of creating composite tokens, with the internal __, is in a few corner cases interfering with evaluation – for example, if the original corpus retains any tokens which already had __ in them

    • maybe for certain cases of insufficient data, the collision of similar-meaning homographs of different parts-of-speech actually helps thicken the vaguer meaning-to-meaning translation. (For example, maybe given the semantic relatedness of shop_NOUN and shop_VERB, it's better to have 100 colliding examples of shop than 50 of each.)

    Some debugging ideas (in addition to the obvious "double-check everything"):

    • look closely at exactly those test cases where the plain-vs-POS approaches differ in their scoring; see if there are any patterns – like strange tokens/punctuation, nonstandard grammar, etc – giving clues to where the __POS decorations hurt

    • try other language pairs, and other (private or public) concordance datasets, to see if elsewhere (or in general) the POS-tagging does help, and there's something extra-challenging about your particular dataset/language-pair

    • consider that the multiplication-of-tokens (by splitting homographs into POS-specific variants) has changed the model size & word-distributions, in ways that might interact with other limits (like min_count, max_vocab_size, etc) in ways that modify training. In particular, perhaps the larger-vocabulary POS model should get more training epochs, or a larger word-vector dimensionality, to reflect its larger vocabulary with a lower average number of word-occurrences.

    Good luck!