python machine-learning nlp embedding bert-language-model

Using Sentence-Bert with other features in scikit-learn

I have a dataset, one feature is text and 4 more features. Sentence-Bert vectorizer transforms text data into tensors. I can use these sparse matrices directly with a machine learning classifier. Can I replace the text column with tensors? And, how can I train the model. The code below is how I transform the text into vectors.

model = SentenceTransformer('sentence-transformers/LaBSE')
sentence_embeddings = model.encode(X_train['tweet'], convert_to_tensor=True, show_progress_bar=True)
sentence_embeddings1 = model.encode(X_test['tweet'], convert_to_tensor=True, show_progress_bar=True)

Solution

Let's assume this is your data

X_train = pd.DataFrame({
    'tweet':['foo', 'foo', 'bar'],
    'feature1':[1, 1, 0],
    'feature2':[1, 0, 1],
})
y_train = [1, 1, 0]

and you are willing to use it with sklearn API (cross-validation, pipeline, grid-search, and so on). There is a utility named ColumnTransformer which can map pandas data frames to the desired data using user-defined arbitrary functions! what you have to do is define a function and create an official sklearn.transformer from it.

model = SentenceTransformer('mrm8488/bert-tiny-finetuned-squadv2') # model named is changed for time and computation gians :)
embedder = FunctionTransformer(lambda item:model.encode(item, convert_to_tensor=True, show_progress_bar=False).detach().cpu().numpy())

After that, you would be able to use the transformer like any other transformer and map your text column into semantic space, like:

preprocessor = ColumnTransformer(
    transformers=[('embedder', embedder, 'tweet')],
    remainder='passthrough'
    )
X_train = preprocessor.fit_transform(X_train) # X_train.shape => (len(df), your_transformer_model_hidden_dim + your_features_count)

X_train would be the data you wanted. It's proper to use with sklearn ecosystem.

gnb = GaussianNB()
gnb.fit(X_train, y_train)

output: GaussianNB(priors=None, var_smoothing=1e-09)

caveat: Numerical features and the tweets embeddings should belong to the same SCALE otherwise some would dominate others and degrade the performance