Search code examples
pythonmachine-learningnlpword-embedding

Use word2vec in tokenized sentences


I am trying to create a emotion recognition model resorting to SVM. I have a big dataset of sentences each one with a labeled emotion. After text pre-processing, I have a pandas data frame containing the tokenized sentences, like it can be seen in [1.] Dataframe adter pre-processing.

My objective is to turn all this tokenized sentences to word embeddings so that I can train models such as SVM. The problem is how to use this date frame as input to word2vec or any other word embedding model.


Solution

  • You need one vector per input instance if you want to use SVM. This means that you need to get embeddings for the words and do some operation, typically pooling, that will shrink the sequence of word embeddings into a single vector.

    The most frequently used methods are mean-pooling and max-pooling, simply taking the average or the maximum of the embeddings.

    Assuming, you pandas data frames in variable data and you have the word embeddings in a dictionary embedding_table with string keys and NumPy array value, you can do something like this (mean pooling), assuming that at least one word is covered by the word embeddings:

    def embed(word_sequence):
        embeddings = []
        for word in word_sequence:
            if word in embedding_table:
                embeddings.append(word)
        return np.mean(embeddings, axis=0)
    
    data["vector"] = data.Utterance.map(embed)