python nlp dataset text-classification spacy

NLP data preparation and sorting for text-classification task

I read a lot of tutorials on the web and topics on stackoverflow but one question is still foggy for me. If consider just the stage of collecting data for multi-label training, what way (see below) are better and whether are both of them acceptable and effective?

Try to find 'pure' one-labeled examples at any cost.
Every example can be multi labeled.

For instance, I have articles about war, politics, economics, culture. Usually, politics tied to economics, war connected to politics, economics issues may appear in culture articles etc. I can assign strictly one main theme for each example and drop uncertain works or assign 2, 3 topics.

I'm going to train data using Spacy, volume of data will be about 5-10 thousand examples per topic.

I'd be grateful for any explanation and/or a link to some relevant discussion.

Solution

You can try OneVsAll / OneVsRest strategy. This will allow you to do both: predict exact one category without the need to strictly assign one label.

Also known as one-vs-all, this strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy for multiclass classification and is a fair default choice.

This strategy can also be used for multilabel learning, where a classifier is used to predict multiple labels for instance, by fitting on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and 0 otherwise.

Link to docs: https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html