Search code examples
rglmsentiment-analysis

Sentiment prediction using glm


I am trying to predict sentiments using glm and ran into following problem

  train_data_df <- as.data.frame(as.matrix(train_data))
  log_model <- glm(sentiment ~ word_count, data = train_data_df,   family = binomial)
     > Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

The data structure for the inputs "sentiment" and "word_count" are as follows

> str(train_data$sentiment[1:2])
List of 2
 $ : num 1
 $ : num 1
> str(train_data$word_count[1:2])
List of 2
 $ :List of 1
  ..    $ :Classes 'term_frequency', 'integer'  Named int [1:24] 3 1 1 1 1 1  1 1 1 3 ...
      .. .. ..- attr(*, "names")= chr [1:24] "and" "bags" "came" "disappointed" ...
 $ :List of 1
  ..    $ :Classes 'term_frequency', 'integer'  Named int [1:22] 2 1 1 1 1 1 1 1 1 1 ...
     .. .. ..- attr(*, "names")= chr [1:22] "and" "anyone" "bed" "comfortable" ...



head(train_data_df[1,])
                   name
2 Planetwise Wipe Pouch
                                                                                                                                                          review
2 it came early and was not disappointed. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.
  rating
2      5
                                                                                                                                                review_clean
2 it came early and was not disappointed i love planet wise bags and now my wipe holder it keps my osocozy wipes moist and does not leak highly recommend it
                                                              word_count sentiment
2 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1         1

Thanks in advance for helping me out


Solution

  • In an R formula like the one you use, sentiment ~ word_count, each side is expected to be a single number or factor per row (this is what 'x' must be atomic means). This is obviously not the case with your word_count column - it appears that, for each row, word_count is a list consisting of several integer values (Have you called 'sort' on a list? - well, indeed you have).

    To confirm that this is the source of your issue, you can replace word_count with the sum of its elements; this should make the code to work (of course, if the result will be of any real value for sentiment prediction, it is another story, but this is not your actual question here...)