I've read my train, test and validation sentences into train_sentences, test_sentences, val_sentences
Then I applied Tf-IDF vectorizer on these.
vectorizer = TfidfVectorizer(max_features=300)
vectorizer = vectorizer.fit(train_sentences)
X_train = vectorizer.transform(train_sentences)
X_val = vectorizer.transform(val_sentences)
X_test = vectorizer.transform(test_sentences)
And my model looks like this
model = Sequential()
model.add(Input(????))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(8, activation='sigmoid'))
model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
Normally we pass embeddings matrix in the embeddings layer in case of word2vec.
How should I use Tf-IDF in Keras model? Please provide me with an example to use.
Thanks.
I cannot imagine a good reason for combining TF/IDF values with embedding vectors, but here is a possible solution: use the functional API, multiple Input
s and the concatenate
function.
To concatenate layer outputs, their shapes must be aligned (except for the axis that is being concatenated). One method is to average embeddings and then concatenate to a vector of TF/IDF values.
Setting up, and some sample data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import keras
from keras.models import Model
from keras.layers import Dense, Activation, concatenate, Embedding, Input
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# some sample training data
bunch = fetch_20newsgroups()
all_sentences = []
for document in bunch.data:
sentences = document.split("\n")
all_sentences.extend(sentences)
all_sentences = all_sentences[:1000]
X_train, X_test = train_test_split(all_sentences, test_size=0.1)
len(X_train), len(X_test)
vectorizer = TfidfVectorizer(max_features=300)
vectorizer = vectorizer.fit(X_train)
df_train = vectorizer.transform(X_train)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
maxlen = 50
sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_train = pad_sequences(sequences_train, maxlen=maxlen)
Model definition
vocab_size = len(tokenizer.word_index) + 1
embedding_size = 300
input_tfidf = Input(shape=(300,))
input_text = Input(shape=(maxlen,))
embedding = Embedding(vocab_size, embedding_size, input_length=maxlen)(input_text)
# this averaging method taken from:
# https://stackoverflow.com/a/54217709/1987598
mean_embedding = keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1))(embedding)
concatenated = concatenate([input_tfidf, mean_embedding])
dense1 = Dense(256, activation='relu')(concatenated)
dense2 = Dense(32, activation='relu')(dense1)
dense3 = Dense(8, activation='sigmoid')(dense2)
model = Model(inputs=[input_tfidf, input_text], outputs=dense3)
model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
Model Summary Output
Model: "model_2"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_11 (InputLayer) (None, 50) 0
__________________________________________________________________________________________________
embedding_5 (Embedding) (None, 50, 300) 633900 input_11[0][0]
__________________________________________________________________________________________________
input_10 (InputLayer) (None, 300) 0
__________________________________________________________________________________________________
lambda_1 (Lambda) (None, 300) 0 embedding_5[0][0]
__________________________________________________________________________________________________
concatenate_4 (Concatenate) (None, 600) 0 input_10[0][0]
lambda_1[0][0]
__________________________________________________________________________________________________
dense_5 (Dense) (None, 256) 153856 concatenate_4[0][0]
__________________________________________________________________________________________________
dense_6 (Dense) (None, 32) 8224 dense_5[0][0]
__________________________________________________________________________________________________
dense_7 (Dense) (None, 8) 264 dense_6[0][0]
==================================================================================================
Total params: 796,244
Trainable params: 796,244
Non-trainable params: 0