machine-learning scikit-learn document-classification

Scikit-learn Multiclass Naive Bayes with probabilities for y

I'm doing a tweet classification, where each tweet can belong to one of few classes. The training set output is given as the probability for belonging that sample to each class. Eg: tweet#1 : C1-0.6, C2-0.4, C3-0.0 (C1,C2,C3 being classes)

I'm planning to use a Naive Bayes classifier using Scikit-learn. I couldn't find a fit method in naive_bayes.py which takes probability for each class for training. I need a classifier which accepts output probability for each class for the training set. (ie: y.shape = [n_samples, n_classes])

How can I process my data set to apply a NaiveBayes classifier?

Solution

This is not so easy, as the "classes probability" can have many interpretations.

In case of NB classifier and sklearn the easiest procedure I see is:

Split (duplicate) your training samples according to the following rule: given (x, [p1, p2, ..., pk ]) sample (where pi is probability for ith class) create artificial training samples: (x, 1, p1), (x, 2, p2), ..., (x, k, pk). So you get k new observations, each "attached" to one class, and pi is treated as a sample weight, which NB (in sklearn) accepts.
Train your NB with fit(X,Y,sample_weights) (where X is a matrix of your x observations, Y is a matrix of classes from previous step, and sample_weights is a matrix of pi from the previous step.

For example if your training set consists of two points:

( [0 1], [0.6 0.4] )
( [1 3], [0.1 0.9] )

You transform them to:

( [0 1], 1, 0.6)
( [0 1], 2, 0.4)
( [1 3], 1, 0.1)
( [1 3], 2, 0.9)

and train NB with

X = [ [0 1], [0 1], [1 3], [1 3] ]
Y = [ 1, 2, 1, 2 ]
sample_weights = [ 0.6 0.4 0.1 0.9 ]