Search code examples
pythonnlpspacy

Train Spacy TextCategorizer on text that belongs to no label


I started to experiment with Spacy's TextCategorizer and was able to train a model with a few hundred examples and exclusive labels for each example. My idea was to apply this model to text chunks (sentence by sentence, or paragraph by paragraph) and get a label for each chunk. But a lot of chunks should actually be without label, as they do not belong to any category. I had two ideas:

  • Add an additional label other and train examples that don't belong to any other category with this label.
  • Set the scores of all label to 0.0 for the examples that don't belong to any other category.

Or is there any other approach? Is this something the TextCategorizer can do or are there other models that I can try that might work better?


Solution

  • It sounds like you should use the SpanCategorizer that will be released soon in 3.1. Regarding your other approaches...

    Add an additional label other and train examples that don't belong to any other category with this label.

    This is fine, except that "other" categories tend to be hard to learn.

    Set the scores of all label to 0.0 for the examples that don't belong to any other category.

    I am pretty sure this won't work. textcat isn't designed to be used that way, and even if you don't get an error in training I don't think the model will be able to train usefully.