Search code examples
neural-networknlpword-embeddingallennlpelmo

NLP ELMo model pruning input


I am trying to retrieve embeddings for words based on the pretrained ELMo model available on tensorflow hub. The code I am using is modified from here: https://www.geeksforgeeks.org/overview-of-word-embedding-using-embeddings-from-language-models-elmo/

The sentence that I am inputting is
bod =" is coming up in and every project is expected to do a video due on we look forward to discussing this with you at our meeting this this time they have laid out the selection criteria for the video award s go for the top spot this time "

and these are the keywords I want embeddings for:
words=["do", "a", "video"]

embeddings = elmo([bod],
signature="default",
as_dict=True)["elmo"]
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

this sentence is 236 characters in length. this is the picture showing that lenbod

but when I put this sentence into the ELMo model, the tensor that is returned is only contains a string of length 48 tensor dim

and this becomes a problem when i try to extract embeddings for keywords that are outside the 48 length limit because the indices of the keywords are shown to be outside this length: kywordlen

this is the code I used to get the indices for the words in 'bod'(as shown above)

num_list=[]
for item in words:
  print(item)
  index = bod.index(item)
  num_list.append(index)
num_list

But i keep running into this error: error

I tried looking for ELMo documentation to explain why this is happening but I have not found anything related to this problem of pruned input.

Any advice is much appreciated!

Thank You


Solution

  • This is not really an AllenNLP issue since you are using a tensorflow-based implementation of ELMo.

    That said, I think the problem is that ELMo embeds tokens, not characters. You are getting 48 embeddings because the string has 48 tokens.