Search code examples
pythonmachine-learningnlpembeddingbert-language-model

Using Sentence-Bert with other features in scikit-learn


I have a dataset, one feature is text and 4 more features. Sentence-Bert vectorizer transforms text data into tensors. I can use these sparse matrices directly with a machine learning classifier. Can I replace the text column with tensors? And, how can I train the model. The code below is how I transform the text into vectors.

model = SentenceTransformer('sentence-transformers/LaBSE')
sentence_embeddings = model.encode(X_train['tweet'], convert_to_tensor=True, show_progress_bar=True)
sentence_embeddings1 = model.encode(X_test['tweet'], convert_to_tensor=True, show_progress_bar=True)

Solution

  • Let's assume this is your data

    X_train = pd.DataFrame({
        'tweet':['foo', 'foo', 'bar'],
        'feature1':[1, 1, 0],
        'feature2':[1, 0, 1],
    })
    y_train = [1, 1, 0]
    

    and you are willing to use it with sklearn API (cross-validation, pipeline, grid-search, and so on). There is a utility named ColumnTransformer which can map pandas data frames to the desired data using user-defined arbitrary functions! what you have to do is define a function and create an official sklearn.transformer from it.

    model = SentenceTransformer('mrm8488/bert-tiny-finetuned-squadv2') # model named is changed for time and computation gians :)
    embedder = FunctionTransformer(lambda item:model.encode(item, convert_to_tensor=True, show_progress_bar=False).detach().cpu().numpy())
    

    After that, you would be able to use the transformer like any other transformer and map your text column into semantic space, like:

    preprocessor = ColumnTransformer(
        transformers=[('embedder', embedder, 'tweet')],
        remainder='passthrough'
        )
    X_train = preprocessor.fit_transform(X_train) # X_train.shape => (len(df), your_transformer_model_hidden_dim + your_features_count)
    

    X_train would be the data you wanted. It's proper to use with sklearn ecosystem.

    gnb = GaussianNB()
    gnb.fit(X_train, y_train) 
    

    output: GaussianNB(priors=None, var_smoothing=1e-09)

    caveat: Numerical features and the tweets embeddings should belong to the same SCALE otherwise some would dominate others and degrade the performance