Search code examples
rmachine-learningclassificationtext-mining

Naive Bayes classifier bases decision only on a-priori probabilities


I'm trying to classify tweets according to their sentiment into three categories (Buy, Hold, Sell). I'm using R and the package e1071.

I have two data frames: one trainingset and one set of new tweets which sentiment need to be predicted.

trainingset dataframe:

   +--------------------------------------------------+

   **text | sentiment**

   *this stock is a good buy* | Buy

   *markets crash in tokyo* | Sell

   *everybody excited about new products* | Hold

   +--------------------------------------------------+

Now I want to train the model using the tweet text trainingset[,2] and the sentiment category trainingset[,4].

classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)

Looking into the elements of classifier with

classifier$tables$x

I find that the conditional probabilities are calculated..There are different probabilities for every tweet concerning Buy,Hold and Sell.So far so good.

However when I predict the training set with:

predict(classifier, trainingset[,2], type="raw")

I get a classification which is based only on the a-priori probabilities, which means every tweet is classified as Hold (because "Hold" had the largest share among the sentiment). So every tweet has the same probabilities for Buy, Hold, and Sell:

      +--------------------------------------------------+

      **Id | Buy | Hold | Sell**

      1  |0.25 | 0.5  | 0.25

      2  |0.25 | 0.5  | 0.25

      3  |0.25 | 0.5  | 0.25

     ..  |..... | ....  | ...

      N  |0.25 | 0.5  | 0.25

     +--------------------------------------------------+

Any ideas what I'm doing wrong? Appreciate your help!

Thanks


Solution

  • It looks like you trained the model using whole sentences as inputs, while it seems that you want to use words as your input features.

    Usage:

    ## S3 method for class 'formula'
    naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
    ## Default S3 method:
    naiveBayes(x, y, laplace = 0, ...)
    
    
    ## S3 method for class 'naiveBayes'
    predict(object, newdata,
      type = c("class", "raw"), threshold = 0.001, ...)
    

    Arguments:

      x: A numeric matrix, or a data frame of categorical and/or
         numeric variables.
    
      y: Class vector.
    

    In particular, if you train naiveBayes this way:

    x <- c("john likes cake", "marry likes cats and john")
    y <- as.factor(c("good", "bad")) 
    bayes<-naiveBayes( x,y )
    

    you get a classifier able to recognize just these two sentences:

    Naive Bayes Classifier for Discrete Predictors
    
    Call:
    naiveBayes.default(x = x,y = y)
    
    A-priori probabilities:
    y
     bad good 
     0.5  0.5 
    
    Conditional probabilities:
                x
          x
    y      john likes cake marry likes cats and john
      bad                0                         1
      good               1                         0
    

    to achieve a word level classifier you need to run it with words as inputs

    x <-             c("john","likes","cake","marry","likes","cats","and","john")
    y <- as.factors( c("good","good", "good","bad",  "bad",  "bad", "bad","bad") )
    bayes<-naiveBayes( x,y )
    

    you get

    Naive Bayes Classifier for Discrete Predictors
    
    Call:
    naiveBayes.default(x = x,y = y)
    
    A-priori probabilities:
    y
     bad good 
     0.625 0.375 
    
    Conditional probabilities:
          x
    y            and      cake      cats      john     likes     marry
      bad  0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
      good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000
    

    In general R is not well suited for processing NLP data, python (or at least Java) would be much better choice.

    To convert a sentence to the words, you can use the strsplit function

    unlist(strsplit("john likes cake"," "))
    [1] "john"  "likes" "cake"