machine-learning classification spacy text-classification multilabel-classification

Text Classification of News Articles Using Spacy

Dataset : Csv files containing around 1500 data with columns (Text,Labels) where Text is the news article of Nepali Language and Label is its genre(Health, World,Tourism, Weather) and so on.

I am using Spacy to train my Text Classification Model. So far, I have converted the dataset to a dataframe which looks like this and then into a spacy acceptable format through the code

dataset['tuples'] = dataset.apply(
    lambda row: (row['Text'],row['Labels']), axis=1)
training_data = dataset['tuples'].tolist()

which gives me the list of tuples in my training dataset like [('text...','label...'),('text...','label...')]

Now, how can I do text classification here?

In the spacy's documentation, I found

textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

Do we have to add the labels according to the labels or should we use positive/negative as well? Does spacy generate the labels according to our dataset after training or not?

Any suggestions please?

Solution

You have to add your own labels. So, in your case:

textcat.add_label('Health')
textcat.add_label('World')
textcat.add_label('Tourism')
...

spacy then will be able to predict only those categories, that you added in the above block of code

There is a special format for training data: each element of your list with data is a tuple that contains:

Text
Dictionary with one element only. cats is a key and another dictionary is a value. That another dictionary contains all your categories as keys and 1 or 0 as values indicating whether this category is correct or not.

So, your data should look like this:

[('text1', {'cats' : {'category1' : 1, 'category2' : 0, ...}}), ('text2', {'cats' : {'category1' : 0, 'category2' : 1, ...}}), ...]