I am doing text classification and plan to use word2vec word embeddings and pass it to Conv1D layers for text classification. I have a dataframe which contains the texts and corresponding labels(sentiments). I have used the gensim module and used word2vec algorithm to generate the word-embedding model. The code I used:
import pandas as pd
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
df=pd.read_csv('emotion_merged_dataset.csv')
texts=df['text']
labels=df['sentiment']
df_tokenized=df.apply(lambda row: word_tokenize(row['text']), axis=1)
model = Word2Vec(df_tokenized, min_count=1)
I plan to use CNN and use this word-embedding model. But how should I use this word-embedding model for my cnn? What should be my input?
I plan to use something like(obviously not with the same hyper-parameters):
model = Sequential()
model.add(layers.Embedding(max_features, 128, input_length=max_len))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))
Can somebody help me out and point me in the right direction? Thanks in advance.
Sorry for the late response, I hope it is still useful for you. Depending on your application you may need to download a specific wordembedding file, for example here yoou have the Glove files
EMBEDDING_FILE='glove.6B.50d.txt'
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector
#this is how you load the weights in the embedding layer
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
I took this code from Jeremy Howard, I think this is all you need, if you want to load other file the process is pretty similar, usually you just have to change the loading file