Search code examples
machine-learningclassificationspacytext-classificationmultilabel-classification

Text Classification of News Articles Using Spacy


Dataset : Csv files containing around 1500 data with columns (Text,Labels) where Text is the news article of Nepali Language and Label is its genre(Health, World,Tourism, Weather) and so on.

I am using Spacy to train my Text Classification Model. So far, I have converted the dataset to a dataframe which looks like this enter image description here and then into a spacy acceptable format through the code

dataset['tuples'] = dataset.apply(
    lambda row: (row['Text'],row['Labels']), axis=1)
training_data = dataset['tuples'].tolist()

which gives me the list of tuples in my training dataset like [('text...','label...'),('text...','label...')]

Now, how can I do text classification here?

In the spacy's documentation, I found

textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

Do we have to add the labels according to the labels or should we use positive/negative as well? Does spacy generate the labels according to our dataset after training or not?

Any suggestions please?


Solution

  • You have to add your own labels. So, in your case:

    textcat.add_label('Health')
    textcat.add_label('World')
    textcat.add_label('Tourism')
    ...
    

    spacy then will be able to predict only those categories, that you added in the above block of code

    There is a special format for training data: each element of your list with data is a tuple that contains:

    1. Text
    2. Dictionary with one element only. cats is a key and another dictionary is a value. That another dictionary contains all your categories as keys and 1 or 0 as values indicating whether this category is correct or not.

    So, your data should look like this:

    [('text1', {'cats' : {'category1' : 1, 'category2' : 0, ...}}), ('text2', {'cats' : {'category1' : 0, 'category2' : 1, ...}}), ...]