I'm doing a tweet classification, where each tweet can belong to one of few classes. The training set output is given as the probability for belonging that sample to each class. Eg: tweet#1 : C1-0.6, C2-0.4, C3-0.0 (C1,C2,C3 being classes)
I'm planning to use a Naive Bayes classifier using Scikit-learn. I couldn't find a fit method in naive_bayes.py which takes probability for each class for training. I need a classifier which accepts output probability for each class for the training set. (ie: y.shape = [n_samples, n_classes])
How can I process my data set to apply a NaiveBayes classifier?
This is not so easy, as the "classes probability" can have many interpretations.
In case of NB classifier and sklearn the easiest procedure I see is:
fit(X,Y,sample_weights)
(where X
is a matrix of your x
observations, Y
is a matrix of classes from previous step, and sample_weights
is a matrix of pi from the previous step.For example if your training set consists of two points:
You transform them to:
and train NB with
X = [ [0 1], [0 1], [1 3], [1 3] ]
Y = [ 1, 2, 1, 2 ]
sample_weights = [ 0.6 0.4 0.1 0.9 ]