Search code examples
pythontensorflowkerasscikit-learntfidfvectorizer

Use Tf-Idf with in Keras Model


I've read my train, test and validation sentences into train_sentences, test_sentences, val_sentences

Then I applied Tf-IDF vectorizer on these.

vectorizer = TfidfVectorizer(max_features=300)
vectorizer = vectorizer.fit(train_sentences)

X_train = vectorizer.transform(train_sentences)
X_val = vectorizer.transform(val_sentences)
X_test = vectorizer.transform(test_sentences)

And my model looks like this

model = Sequential()

model.add(Input(????))

model.add(Flatten())

model.add(Dense(256, activation='relu'))

model.add(Dense(32, activation='relu'))

model.add(Dense(8, activation='sigmoid'))

model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Normally we pass embeddings matrix in the embeddings layer in case of word2vec.

How should I use Tf-IDF in Keras model? Please provide me with an example to use.

Thanks.


Solution

  • I cannot imagine a good reason for combining TF/IDF values with embedding vectors, but here is a possible solution: use the functional API, multiple Inputs and the concatenate function.

    To concatenate layer outputs, their shapes must be aligned (except for the axis that is being concatenated). One method is to average embeddings and then concatenate to a vector of TF/IDF values.

    Setting up, and some sample data

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split
    
    from sklearn.datasets import fetch_20newsgroups
    
    import numpy as np
    
    import keras
    
    from keras.models import Model
    from keras.layers import Dense, Activation, concatenate, Embedding, Input
    
    from keras.preprocessing.text import Tokenizer
    from keras.preprocessing.sequence import pad_sequences
    
    # some sample training data
    bunch = fetch_20newsgroups()
    all_sentences = []
    
    for document in bunch.data:
      sentences = document.split("\n")
      all_sentences.extend(sentences)
    
    all_sentences = all_sentences[:1000]
    
    X_train, X_test = train_test_split(all_sentences, test_size=0.1)
    len(X_train), len(X_test)
    
    vectorizer = TfidfVectorizer(max_features=300)
    vectorizer = vectorizer.fit(X_train)
    
    df_train = vectorizer.transform(X_train)
    
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(X_train)
    
    maxlen = 50
    
    sequences_train = tokenizer.texts_to_sequences(X_train)
    sequences_train = pad_sequences(sequences_train, maxlen=maxlen)
    

    Model definition

    vocab_size = len(tokenizer.word_index) + 1
    embedding_size = 300
    
    input_tfidf = Input(shape=(300,))
    input_text = Input(shape=(maxlen,))
    
    embedding = Embedding(vocab_size, embedding_size, input_length=maxlen)(input_text)
    
    # this averaging method taken from:
    # https://stackoverflow.com/a/54217709/1987598
    
    mean_embedding = keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1))(embedding)
    
    concatenated = concatenate([input_tfidf, mean_embedding])
    
    dense1 = Dense(256, activation='relu')(concatenated)
    dense2 = Dense(32, activation='relu')(dense1)
    dense3 = Dense(8, activation='sigmoid')(dense2)
    
    model = Model(inputs=[input_tfidf, input_text], outputs=dense3)
    
    model.summary()
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    

    Model Summary Output

    Model: "model_2"
    __________________________________________________________________________________________________
    Layer (type)                    Output Shape         Param #     Connected to                     
    ==================================================================================================
    input_11 (InputLayer)           (None, 50)           0                                            
    __________________________________________________________________________________________________
    embedding_5 (Embedding)         (None, 50, 300)      633900      input_11[0][0]                   
    __________________________________________________________________________________________________
    input_10 (InputLayer)           (None, 300)          0                                            
    __________________________________________________________________________________________________
    lambda_1 (Lambda)               (None, 300)          0           embedding_5[0][0]                
    __________________________________________________________________________________________________
    concatenate_4 (Concatenate)     (None, 600)          0           input_10[0][0]                   
                                                                     lambda_1[0][0]                   
    __________________________________________________________________________________________________
    dense_5 (Dense)                 (None, 256)          153856      concatenate_4[0][0]              
    __________________________________________________________________________________________________
    dense_6 (Dense)                 (None, 32)           8224        dense_5[0][0]                    
    __________________________________________________________________________________________________
    dense_7 (Dense)                 (None, 8)            264         dense_6[0][0]                    
    ==================================================================================================
    Total params: 796,244
    Trainable params: 796,244
    Non-trainable params: 0