Search code examples
rnormalizationk-meanshierarchical-clustering

I am dealing with a DTM and I want to do k-means, heirarchical, and k-medoids clustering. Am I suppose to normalize the DTM first?


The data, AllBooks has 590 observations of 8266 variables. Here is the code I have:

AllBooks = read_csv("AllBooks_baseline_DTM_Unlabelled.csv")
dtms = as.matrix(AllBooks)
dtms_freq = as.matrix(rowSums(dtms) / 8266)
dtms_freq1 = dtms_freq[order(dtms_freq),]
sd = sd(dtms_freq)
mean = mean(dtms_freq)

This tells me that my mean is: 0.01242767 and my std. dev. is: 0.01305608

So since my standard deviation is low this means the data has low variability in terms of size of documents. So I do not need to normalize the DTM? And by normalize I mean using the scale function in R which subtracts the mean of the data and divides by the standard deviation.

In other words my big questions is: When am I suppose to standardize data (specifically a Document Term Matrix) for clustering purposes?

Here is a little output of data:

dput(head(AllBooks,10))
budding = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), enjoyer = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), needs = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), sittest = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), eclipsed = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), engagement = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
    exuberant = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), abandons = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), well = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), cheerfulness = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
    hatest = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), state = c(0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0), stained = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), production = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), whitened = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), revered = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), developed = c(0, 0, 0, 2, 0, 0, 0, 0, 0, 0), 
    regarded = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), enactments = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), aromatical = c(0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0), admireth = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0
    ), foothold = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), shots = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), turner = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), inversion = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
    lifeless = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), postponement = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), stout = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), taketh = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), kettle = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), erred = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), thinkest = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), modern = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), reigned = c(0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0), sparingly = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
    visual = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), thoughts = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), illumines = c(0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0), attire = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
    explains = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -10L))

You can view full data from link: https://www.dropbox.com/s/p9v1y6oxith1prh/AllBooks_baseline_DTM_Unlabelled.csv?dl=0


Solution

  • You have a sparse dataset, where most of it is dominated by zeros, hence standard deviation is very low. You can scale it if some of your non-zero counts are extremely large, eg some are 100s while others are 1s and 2s.

    It might not be such a good idea to use kmeans on sparse data, because it is unlikely you can find meaningful centers. There might be a few options available, check this link on dimension reduction.There are also graph based approaches, such as this used in biology.

    Below is a simplistic way to clust and visualize:

    x = read.csv("AllBooks_baseline_DTM_Unlabelled.csv")
    # remove singleton columns
    x = x[rowMeans(x)>0,colSums(x>0)>1]
    

    Treat it as binary and hierachical on a binary distance:

    hc=hclust(dist(x,method="binary"),method="ward.D")
    clus = cutree(hc,5)
    

    Calculate PCA and visualize:

    library(Rtsne)
    library(ggplo2)
    
    pca = prcomp(x,scale=TRUE,center=TRUE)
    TS = Rtsne(pca$x[,1:30])
    ggplot(data.frame(Dim1=TS$Y[,1],Dim2=TS$Y[,2],C=factor(clus)),
    aes(x=Dim1,y=Dim2,col=C))+geom_point()
    

    enter image description here

    Cluster 5 seems to be very different, and they differ in these words:

    names(tail(sort(colMeans(x[clus==5,]) - colMeans(x[clus!=5,])),10))
     [1] "wisdom" "thee"   "lord"   "things" "god"    "hath"   "thou"   "man"   
     [9] "thy"    "shall"