Search code examples
pythontensorflowmachine-learningscikit-learnword-embedding

How to use word embedding and feature for text classification


I have a bunch of sentences that I am trying to classify. For each sentence, I generated a word embedding using word2vec. I also performed a cluster analysis which clustered the sentences into 3 separate clusters.

What I want to do is use the cluster id (1-3) as a feature for my model. However, I am just not entirely sure how to do this? I can't seem to find a good article that clearly states how to do this.

I was thinking I could create a one hot embedding for the cluster id and then somehow combine the one hot to the word embedding? I am really not sure what to do here.

I already have a model that will take the word embedding and classify the sentence:

X=Data[word_embedding].values
y=Data[category].values

indices = filtered_products.index.values
X_train, X_test, y_train, y_test, indices_train, indices_test, = train_test_split(X, y, indices, test_size=0.3, random_state=428)

clf = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')

DSVM = clf.fit(X_train,y_train)
prediction = DSVM.predict(X_test)

print(metrics.classification_report(y_test, prediction))

Where X is the word embedding and y is the category. Just not sure how to add in the cluster id as a feature


Solution

  • Assuming, you want to use Tensorflow. You can either one-hot encode the ids or map them to n-dimensional random vectors using an Embedding layer. Here is an example with an Embedding layer, where I am mapping each id to a 10-dimensional vector and then repeating this vector 50 times to correspond to the max length of a sentence (So, each word has the same 10-dimensional vector for a given input). Afterwards, I just concatenate:

    import tensorflow as tf
    
    word_embedding_dim = 300
    max_sentence_length = 50
    
    word_embedding_input = tf.keras.layers.Input((max_sentence_length, word_embedding_dim))
    
    id_input = tf.keras.layers.Input((1, ))
    embedding_layer = tf.keras.layers.Embedding(1, 10) # or one-hot encode
    x = embedding_layer(id_input)
    x = tf.keras.layers.RepeatVector(max_sentence_length)(x[:, 0, :])
    
    output = tf.keras.layers.Concatenate()([word_embedding_input, x])
    model = tf.keras.Model([word_embedding_input, id_input], output)
    
    print(model.summary())
    
    Model: "model_1"
    __________________________________________________________________________________________________
     Layer (type)                   Output Shape         Param #     Connected to                     
    ==================================================================================================
     input_17 (InputLayer)          [(None, 1)]          0           []                               
                                                                                                      
     embedding_3 (Embedding)        (None, 1, 10)        10          ['input_17[0][0]']               
                                                                                                      
     tf.__operators__.getitem (Slic  (None, 10)          0           ['embedding_3[0][0]']            
     ingOpLambda)                                                                                     
                                                                                                      
     input_16 (InputLayer)          [(None, 50, 300)]    0           []                               
                                                                                                      
     repeat_vector_1 (RepeatVector)  (None, 50, 10)      0           ['tf.__operators__.getitem[0][0]'
                                                                     ]                                
                                                                                                      
     concatenate (Concatenate)      (None, 50, 310)      0           ['input_16[0][0]',               
                                                                      'repeat_vector_1[0][0]']        
                                                                                                      
    ==================================================================================================
    Total params: 10
    Trainable params: 10
    Non-trainable params: 0
    __________________________________________________________________________________________________
    None
    

    If you do not have a 2D input, but actually sentence embeddings, it is even easier:

    import tensorflow as tf
    
    sentence_embedding_dim = 300
    
    sentence_embedding_input = tf.keras.layers.Input((sentence_embedding_dim,))
    id_input = tf.keras.layers.Input((1, ))
    embedding_layer = tf.keras.layers.Embedding(1, 10) # or one-hot encode
    x = embedding_layer(id_input)
    
    output = tf.keras.layers.Concatenate()([sentence_embedding_input, x[:, 0, :]])
    model = tf.keras.Model([sentence_embedding_input, id_input], output)
    

    Here is a solution with numpy and sklearn for reference:

    import numpy as np
    from sklearn.preprocessing import OneHotEncoder
    
    samples = 10
    word_embedding_dim = 300
    max_sentence_length = 50
    
    ids = np.random.randint(low=1, high=4, size=(10,)).reshape(-1, 1)
    enc = OneHotEncoder(handle_unknown='ignore')
    ids = enc.fit_transform(ids).toarray()[:, None, :]
    
    X_train = np.random.random((samples, max_sentence_length, word_embedding_dim))
    
    ids = np.repeat(ids, max_sentence_length, axis=1)
    X_train = np.concatenate([X_train, ids], axis=-1)
    print(X_train.shape)
    # (10, 50, 303)