Search code examples
rcluster-analysisterm-document-matrix

how to remove NA columns from TDM for clustering


I'm struggling with TDM NA values to commit the clustering. Initially I've set:

titles.tdm <- as.matrix(TermDocumentMatrix(titles.cw, control = list(bounds = list(global = c(10,Inf)))))

titles.sc <- scale(na.omit(titles.tdm))

and got matrix of 418 terms and 6955 documents. At this point executing: titles.km <- kmeans(titles.sc, 2) throws Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

When I've decided to remove those values by:

titles.sf <- titles.sc[,colSums(titles.sc) > 0]

I've got matrix of 4695 documents, but applying the kmeans function still throws this error. When I've viewed the titles.sf variable there are still columns (docs) with NA values. I'm messed up and don't know what doing wrong. How to remove those documents?

Earlier, I've applied titles.cw <- titles.cc[which(str_trim(titles.cc$content) != "")] where titles.cc is pure Corpus object from tm library class, to delete black documents. It probably worked, but my NA values are in documents which are not blank for sure.


Solution

  • Here's some example data:

    set.seed(123)
    titles.sc <- matrix(1:25,5,5)
    titles.sc[sample(length(titles.sc),5)]<-NA 
    titles.sc
    #      [,1] [,2] [,3] [,4] [,5]
    # [1,]    1    6   11   16   21
    # [2,]    2    7   12   17   NA
    # [3,]    3   NA   13   18   23
    # [4,]    4    9   14   NA   24
    # [5,]    5   NA   15   NA   25
    

    kmeans throws your error

    kmeans(titles.sc, 2)
    # Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
    

    because your column subsetting is probably not what you'd expect:

    colSums(titles.sc) > 0
    # [1] TRUE   NA TRUE   NA   NA
    

    colSums produces NA, if missing values are not removed (check the help files under ?colSums). Among other things, you could do

    colSums(is.na(titles.sc)) == 0
    # [1]  TRUE FALSE  TRUE FALSE FALSE
    

    or

    !is.na(colSums(titles.sc) > 0)
    # [1]  TRUE FALSE  TRUE FALSE FALSE
    

    And now, it works:

    titles.sf <- titles.sc[,colSums(is.na(titles.sc)) == 0,drop=F]
    kmeans(titles.sf,2)
    # K-means clustering with 2 clusters of sizes 2, 3
    # ...