Search code examples
machine-learningword2vecword-embedding

Extract more meaningful words from publicly available word embedding


I have two publicly available word embeddings such as Glove and Google Word2vec. However, in their vocabulary, there are too many misspelling words or garbage words(e.g., ##AA##, adirty, etc). To avoid this words, I would like to extract frequent word(e.g., top 50000 words) since I think relatively high frequent words has normal forms.

So, I wonder if there is a way to find word frequency in above two pretrained word embeddings. If not, I want to know if there are some techniques to exclude this words.


Solution

  • The GoogleNews vector set does not contain frequency information, but does seem to be sorted from most-frequent to least-frequent. So, if you change the code that loads it to only load the first N words, you should get the N most-frequent words.

    (The Python gensim library for training or working with word-vectors includes this as a limit option on the load_word2vec_format() function.)

    GLoVe may follow the same convention – a look over the order-of-words in the file should give a good idea.