Dataset : Csv files containing around 1500 data with columns (Text,Labels) where Text is the news article of Nepali Language and Label is its genre(Health, World,Tourism, Weather) and so on.
I am using Spacy to train my Text Classification Model. So far, I have converted the dataset to a dataframe which looks like this and then into a spacy acceptable format through the code
dataset['tuples'] = dataset.apply(
lambda row: (row['Text'],row['Labels']), axis=1)
training_data = dataset['tuples'].tolist()
which gives me the list of tuples in my training dataset like [('text...','label...'),('text...','label...')]
Now, how can I do text classification here?
In the spacy's documentation, I found
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")
Do we have to add the labels according to the labels or should we use positive/negative as well? Does spacy generate the labels according to our dataset after training or not?
Any suggestions please?
You have to add your own labels. So, in your case:
textcat.add_label('Health')
textcat.add_label('World')
textcat.add_label('Tourism')
...
spacy
then will be able to predict only those categories, that you added in the above block of code
There is a special format for training data: each element of your list with data is a tuple that contains:
cats
is a key and another dictionary is a value. That another dictionary contains all your categories as keys and 1
or 0
as values indicating whether this category is correct or not.So, your data should look like this:
[('text1', {'cats' : {'category1' : 1, 'category2' : 0, ...}}),
('text2', {'cats' : {'category1' : 0, 'category2' : 1, ...}}),
...]