Search code examples
pythontensorflowkerasword2vecword-embedding

How do I create a Keras Embedding layer from a pre-trained word embedding dataset?


How do I load a pre-trained word-embedding into a Keras Embedding layer?

I downloaded the glove.6B.50d.txt (glove.6B.zip file from https://nlp.stanford.edu/projects/glove/) and I'm not sure how to add it to a Keras Embedding layer. See: https://keras.io/layers/embeddings/


Solution

  • You will need to pass an embeddingMatrix to the Embedding layer as follows:

    Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)

    • vocabLen: number of tokens in your vocabulary
    • embDim: embedding vectors dimension (50 in your example)
    • embeddingMatrix: embedding matrix built from glove.6B.50d.txt
    • isTrainable: whether you want the embeddings to be trainable or froze the layer

    The glove.6B.50d.txt is a list of whitespace-separated values: word token + (50) embedding values. e.g. the 0.418 0.24968 -0.41242 ...

    To create a pretrainedEmbeddingLayer from a Glove file:

    # Prepare Glove File
    def readGloveFile(gloveFile):
        with open(gloveFile, 'r') as f:
            wordToGlove = {}  # map from a token (word) to a Glove embedding vector
            wordToIndex = {}  # map from a token to an index
            indexToWord = {}  # map from an index to a token 
    
            for line in f:
                record = line.strip().split()
                token = record[0] # take the token (word) from the text line
                wordToGlove[token] = np.array(record[1:], dtype=np.float64) # associate the Glove embedding vector to a that token (word)
    
            tokens = sorted(wordToGlove.keys())
            for idx, tok in enumerate(tokens):
                kerasIdx = idx + 1  # 0 is reserved for masking in Keras (see above)
                wordToIndex[tok] = kerasIdx # associate an index to a token (word)
                indexToWord[kerasIdx] = tok # associate a word to a token (word). Note: inverse of dictionary above
    
        return wordToIndex, indexToWord, wordToGlove
    
    # Create Pretrained Keras Embedding Layer
    def createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, isTrainable):
        vocabLen = len(wordToIndex) + 1  # adding 1 to account for masking
        embDim = next(iter(wordToGlove.values())).shape[0]  # works with any glove dimensions (e.g. 50)
    
        embeddingMatrix = np.zeros((vocabLen, embDim))  # initialize with zeros
        for word, index in wordToIndex.items():
            embeddingMatrix[index, :] = wordToGlove[word] # create embedding: word index to Glove word embedding
    
        embeddingLayer = Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
        return embeddingLayer
    
    # usage
    wordToIndex, indexToWord, wordToGlove = readGloveFile("/path/to/glove.6B.50d.txt")
    pretrainedEmbeddingLayer = createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, False)
    model = Sequential()
    model.add(pretrainedEmbeddingLayer)
    ...