Search code examples
machine-learningpytorchspacyspacy-transformers

SpaCy-transformers regression output


I would like to have a regression output instead of the classification. For instance: instead of n classes I want a floating point output value from 0 to 1.

Here is the minimalistic example from the package github page:

    import spacy
    from spacy.util import minibatch
    import random
    import torch

    is_using_gpu = spacy.prefer_gpu()
    if is_using_gpu:
        torch.set_default_tensor_type("torch.cuda.FloatTensor")

    nlp = spacy.load("en_trf_bertbaseuncased_lg")
    print(nlp.pipe_names) # ["sentencizer", "trf_wordpiecer", "trf_tok2vec"]
    textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": True})
    for label in ("POSITIVE", "NEGATIVE"):
        textcat.add_label(label)
    nlp.add_pipe(textcat)

    optimizer = nlp.resume_training()
    for i in range(10):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for batch in minibatch(TRAIN_DATA, size=8):
            texts, cats = zip(*batch)
            nlp.update(texts, cats, sgd=optimizer, losses=losses)
        print(i, losses)
    nlp.to_disk("/bert-textcat")

Is there an easy way to make trf_textcat work as a regressor? Or would it mean extending the library?


Solution

  • I have figured out a workaround: extract vector representations from the nlp pipeline as:

    vector_repres = nlp('Test text').vector
    

    After doing so for all the text entries, You end up with a fixed-dimensional representation of the texts. Assuming You have the continuous output values, feel free to use any estimator, including Neural Network with a linear output.

    Note that the vector representation is an average of the vector embeddings of all the words in the text - it might be a sub-optimal solution for Your case.