I am trying to predict sentiments using glm and ran into following problem
train_data_df <- as.data.frame(as.matrix(train_data))
log_model <- glm(sentiment ~ word_count, data = train_data_df, family = binomial)
> Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
The data structure for the inputs "sentiment" and "word_count" are as follows
> str(train_data$sentiment[1:2])
List of 2
$ : num 1
$ : num 1
> str(train_data$word_count[1:2])
List of 2
$ :List of 1
.. $ :Classes 'term_frequency', 'integer' Named int [1:24] 3 1 1 1 1 1 1 1 1 3 ...
.. .. ..- attr(*, "names")= chr [1:24] "and" "bags" "came" "disappointed" ...
$ :List of 1
.. $ :Classes 'term_frequency', 'integer' Named int [1:22] 2 1 1 1 1 1 1 1 1 1 ...
.. .. ..- attr(*, "names")= chr [1:22] "and" "anyone" "bed" "comfortable" ...
head(train_data_df[1,])
name
2 Planetwise Wipe Pouch
review
2 it came early and was not disappointed. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.
rating
2 5
review_clean
2 it came early and was not disappointed i love planet wise bags and now my wipe holder it keps my osocozy wipes moist and does not leak highly recommend it
word_count sentiment
2 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1 1
Thanks in advance for helping me out
In an R formula like the one you use, sentiment ~ word_count
, each side is expected to be a single number or factor per row (this is what 'x' must be atomic
means). This is obviously not the case with your word_count
column - it appears that, for each row, word_count
is a list consisting of several integer values (Have you called 'sort' on a list?
- well, indeed you have).
To confirm that this is the source of your issue, you can replace word_count
with the sum of its elements; this should make the code to work (of course, if the result will be of any real value for sentiment prediction, it is another story, but this is not your actual question here...)