I'm struggling with TDM NA values to commit the clustering. Initially I've set:
titles.tdm <- as.matrix(TermDocumentMatrix(titles.cw, control = list(bounds = list(global = c(10,Inf)))))
titles.sc <- scale(na.omit(titles.tdm))
and got matrix of 418 terms and 6955 documents. At this point executing:
titles.km <- kmeans(titles.sc, 2)
throws
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
When I've decided to remove those values by:
titles.sf <- titles.sc[,colSums(titles.sc) > 0]
I've got matrix of 4695 documents, but applying the kmeans
function still throws this error. When I've viewed the titles.sf
variable there are still columns (docs) with NA values. I'm messed up and don't know what doing wrong. How to remove those documents?
Earlier, I've applied titles.cw <- titles.cc[which(str_trim(titles.cc$content) != "")]
where titles.cc
is pure Corpus object from tm
library class, to delete black documents. It probably worked, but my NA values are in documents which are not blank for sure.
Here's some example data:
set.seed(123)
titles.sc <- matrix(1:25,5,5)
titles.sc[sample(length(titles.sc),5)]<-NA
titles.sc
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 6 11 16 21
# [2,] 2 7 12 17 NA
# [3,] 3 NA 13 18 23
# [4,] 4 9 14 NA 24
# [5,] 5 NA 15 NA 25
kmeans
throws your error
kmeans(titles.sc, 2)
# Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
because your column subsetting is probably not what you'd expect:
colSums(titles.sc) > 0
# [1] TRUE NA TRUE NA NA
colSums
produces NA
, if missing values are not removed (check the help files under ?colSums
). Among other things, you could do
colSums(is.na(titles.sc)) == 0
# [1] TRUE FALSE TRUE FALSE FALSE
or
!is.na(colSums(titles.sc) > 0)
# [1] TRUE FALSE TRUE FALSE FALSE
And now, it works:
titles.sf <- titles.sc[,colSums(is.na(titles.sc)) == 0,drop=F]
kmeans(titles.sf,2)
# K-means clustering with 2 clusters of sizes 2, 3
# ...