Search code examples
machine-learningscikit-learnclassificationcorpusmultilabel-classification

Multi-label classification involving range of numbers as labels


I have a classification problem where my labels are ratings, 0 - 100, with increments of 1 (e.g. 1, 2, 3, 4,).

I have a data set where each row has a name, text corpus, and a rating (0 - 100).

From the text corpus I am trying to extract features that I can feed into my classifier, which will output a corresponding rating per row (0 - 100).

For feature selection, I am thinking of starting with basic bag of words. My question lies in the classification algorithm, however. Is there a classification algorithm in sci-kit learn that supports this kind of problem?

I was reading http://scikit-learn.org/stable/modules/multiclass.html, but the algorithms described seem to support labels that are completely discrete, whereas I have a set of continuous labels.

EDIT: What about the case where I bin my ratings? For example, I can have 10 labels, each 1- 10.


Solution

  • You can use multi-variate regression instead of classification. U can cluster the n-gram features from text corpus to form a dictionary and use it to form a feature set. With this feature set, train a regression model where output can be continuous values. U can round the output real number to get a discrete label in 1-100