r machine-learning classification text-mining

Naive Bayes classifier bases decision only on a-priori probabilities

I'm trying to classify tweets according to their sentiment into three categories (Buy, Hold, Sell). I'm using R and the package e1071.

I have two data frames: one trainingset and one set of new tweets which sentiment need to be predicted.

trainingset dataframe:

   +--------------------------------------------------+

   **text | sentiment**

   *this stock is a good buy* | Buy

   *markets crash in tokyo* | Sell

   *everybody excited about new products* | Hold

   +--------------------------------------------------+

Now I want to train the model using the tweet text trainingset[,2] and the sentiment category trainingset[,4].

classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)

Looking into the elements of classifier with

classifier$tables$x

I find that the conditional probabilities are calculated..There are different probabilities for every tweet concerning Buy,Hold and Sell.So far so good.

However when I predict the training set with:

predict(classifier, trainingset[,2], type="raw")

I get a classification which is based only on the a-priori probabilities, which means every tweet is classified as Hold (because "Hold" had the largest share among the sentiment). So every tweet has the same probabilities for Buy, Hold, and Sell:

      +--------------------------------------------------+

      **Id | Buy | Hold | Sell**

      1  |0.25 | 0.5  | 0.25

      2  |0.25 | 0.5  | 0.25

      3  |0.25 | 0.5  | 0.25

     ..  |..... | ....  | ...

      N  |0.25 | 0.5  | 0.25

     +--------------------------------------------------+

Any ideas what I'm doing wrong? Appreciate your help!

Thanks

Solution

It looks like you trained the model using whole sentences as inputs, while it seems that you want to use words as your input features.

Usage:

## S3 method for class 'formula'
naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
## Default S3 method:
naiveBayes(x, y, laplace = 0, ...)


## S3 method for class 'naiveBayes'
predict(object, newdata,
  type = c("class", "raw"), threshold = 0.001, ...)

Arguments:

  x: A numeric matrix, or a data frame of categorical and/or
     numeric variables.

  y: Class vector.

In particular, if you train naiveBayes this way:

x <- c("john likes cake", "marry likes cats and john")
y <- as.factor(c("good", "bad")) 
bayes<-naiveBayes( x,y )

you get a classifier able to recognize just these two sentences:

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.5  0.5 

Conditional probabilities:
            x
      x
y      john likes cake marry likes cats and john
  bad                0                         1
  good               1                         0

to achieve a word level classifier you need to run it with words as inputs

x <-             c("john","likes","cake","marry","likes","cats","and","john")
y <- as.factors( c("good","good", "good","bad",  "bad",  "bad", "bad","bad") )
bayes<-naiveBayes( x,y )

you get

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = x,y = y)

A-priori probabilities:
y
 bad good 
 0.625 0.375 

Conditional probabilities:
      x
y            and      cake      cats      john     likes     marry
  bad  0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
  good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000

In general R is not well suited for processing NLP data, python (or at least Java) would be much better choice.

To convert a sentence to the words, you can use the strsplit function

unlist(strsplit("john likes cake"," "))
[1] "john"  "likes" "cake"