Search code examples
nlpnltktext-classificationnaivebayesmulticlass-classification

Multiclass text classification with python and nltk


I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics

About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)

I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).

Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.

Heres the actual code.

This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.

Thank you.


Solution

  • In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.

    In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.

    You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.