Search code examples
pythonkerasword2vecgensimword-embedding

Gensim Word2Vec select minor set of word vectors from pretrained model


I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.

The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer.

Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words?


Solution

  • Thanks to this answer (I've changed the code a little bit to make it better). you can use this code for solving your problem.

    we have all our minor set of words in restricted_word_set(it can be either list or set) and w2v is our model, so here is the function:

    import numpy as np
    
    def restrict_w2v(w2v, restricted_word_set):
        new_vectors = []
        new_vocab = {}
        new_index2entity = []
        new_vectors_norm = []
    
        for i in range(len(w2v.vocab)):
            word = w2v.index2entity[i]
            vec = w2v.vectors[i]
            vocab = w2v.vocab[word]
            vec_norm = w2v.vectors_norm[i]
            if word in restricted_word_set:
                vocab.index = len(new_index2entity)
                new_index2entity.append(word)
                new_vocab[word] = vocab
                new_vectors.append(vec)
                new_vectors_norm.append(vec_norm)
    
        w2v.vocab = new_vocab
        w2v.vectors = np.array(new_vectors)
        w2v.index2entity = np.array(new_index2entity)
        w2v.index2word = np.array(new_index2entity)
        w2v.vectors_norm = np.array(new_vectors_norm)
    

    WARNING: when you first create the model the vectors_norm == None so you will get an error if you use this function there. vectors_norm will get a value of the type numpy.ndarray after the first use. so before using the function try something like most_similar("cat") so that vectors_norm not be equal to None.

    It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

    Usage:

    w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
    w2v.most_similar("beer")
    

    [('beers', 0.8409687876701355),
    ('lager', 0.7733745574951172),
    ('Beer', 0.71753990650177),
    ('drinks', 0.668931245803833),
    ('lagers', 0.6570086479187012),
    ('Yuengling_Lager', 0.655455470085144),
    ('microbrew', 0.6534324884414673),
    ('Brooklyn_Lager', 0.6501551866531372),
    ('suds', 0.6497018337249756),
    ('brewed_beer', 0.6490240097045898)]

    restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
    restrict_w2v(w2v, restricted_word_set)
    w2v.most_similar("beer")
    

    [('lagers', 0.6570085287094116),
    ('wine', 0.6217695474624634),
    ('bash', 0.20583480596542358),
    ('computer', 0.06677375733852386),
    ('python', 0.005948573350906372)]

    it can be used for removing some words either.