how to remove NA columns from TDM for clustering

I'm struggling with TDM NA values to commit the clustering. Initially I've set:

titles.tdm <- as.matrix(TermDocumentMatrix(titles.cw, control = list(bounds = list(global = c(10,Inf)))))

titles.sc <- scale(na.omit(titles.tdm))

and got matrix of 418 terms and 6955 documents. At this point executing: titles.km <- kmeans(titles.sc, 2) throws Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

When I've decided to remove those values by:

titles.sf <- titles.sc[,colSums(titles.sc) > 0]

I've got matrix of 4695 documents, but applying the kmeans function still throws this error. When I've viewed the titles.sf variable there are still columns (docs) with NA values. I'm messed up and don't know what doing wrong. How to remove those documents?

Earlier, I've applied titles.cw <- titles.cc[which(str_trim(titles.cc$content) != "")] where titles.cc is pure Corpus object from tm library class, to delete black documents. It probably worked, but my NA values are in documents which are not blank for sure.

Solution

Here's some example data:

set.seed(123)
titles.sc <- matrix(1:25,5,5)
titles.sc[sample(length(titles.sc),5)]<-NA 
titles.sc
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    6   11   16   21
# [2,]    2    7   12   17   NA
# [3,]    3   NA   13   18   23
# [4,]    4    9   14   NA   24
# [5,]    5   NA   15   NA   25

kmeans throws your error

kmeans(titles.sc, 2)
# Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

because your column subsetting is probably not what you'd expect:

colSums(titles.sc) > 0
# [1] TRUE   NA TRUE   NA   NA

colSums produces NA, if missing values are not removed (check the help files under ?colSums). Among other things, you could do

colSums(is.na(titles.sc)) == 0
# [1]  TRUE FALSE  TRUE FALSE FALSE

!is.na(colSums(titles.sc) > 0)
# [1]  TRUE FALSE  TRUE FALSE FALSE

And now, it works:

titles.sf <- titles.sc[,colSums(is.na(titles.sc)) == 0,drop=F]
kmeans(titles.sf,2)
# K-means clustering with 2 clusters of sizes 2, 3
# ...