Search code examples
rcluster-analysisk-means

Sometimes a NA is returned when stuying the sd of clusters, sometimes none


I have a serie of observations that I cluster using kmeans. Then, I investigate the standard deviation (sd) inside each clusters and get the max.

If I run several time the same code, sometimes, a NA appears.

Reducing k_n (number of clusters) produce the same effect, increasing is worse; with/without na.rm=T doesn't change.

Can someone explain me what I'm doing wrong, please ?

The code:

k_n <-11
clusters=as.data.frame(kmeans(ex,k_n, nstart=50,iter.max = 15 )$cluster)
clusters<-cbind(clusters,ex)
temp<-sapply(1:k_n, function(k){temp=subset(clusters, clusters[,1]==k)
                                sd<-sd(temp[,2], na.rm = T)
                                return(sd)})
max(temp)
temp

Here are the results of three runs. As you can see, the third trial returns a NA, the two others don't.

Results of three runs

And here are the data "ex":

1400 1400 2000 2000 2000 2001 1400 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1400 1401 2000 2000 2000 2000 1401 1401 2000 2000 2000 2000 2000 2000 2000 2000 801 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1400 2000 2000 2000 1400 1400 2000 2000 2000 1200 1600 1500 1960 1350 1900 1900 1900 1350 2000 2000 2000 2000 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 1900 1350 1900 1900 1350 1350 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1920 1350 1350 1900 1900 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 1900 1350 1900 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 2000 2000 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1900 1901 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 2000 1350 1900 1900 1900 1350 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1400 1350 1350 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 0 0 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 1900 1900 1900 1900 1900 1900 1500 1500 1500 1400 1901 1901 1901 1901 2000 1350 1900 1900 1900 1350 1350 2000 2000 2000 2000 1500 1560 1900 1900 1900 1900 1900 1900 1400 2000 1350 1900 1900 1900 1350 2000 2000 2000 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 2000 1900 1900 1900 1900 1900 1900 1900 1900 2000 1350 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1900 1900 1350 1350 1350 1350 1350 1900 2000 1350 1900 1900 1900 1900 1900 1900 1350 1350 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1900 1900 801 1400 1400 1400 1400 1901 1901 1900 1901 1901 1900 1901 1901 1901 1900 1900 1900 1900 1901 1900 1900 1901 1901 1900 1901 1901 1901 1350 1350 1350 1350 1350 1350 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1400 1401 1401 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1100 1100 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1800 2000 1800 1800 1400 1200 1400 1600 1800 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000


Solution

  • You get NA for sd because there is only one member in the cluster.

    Using your example:

    set.seed(1)
    clus = kmeans(ex,k_n, nstart=50,iter.max = 15 )
    

    Using your code:

    clusters=as.data.frame(clus$cluster)
    clusters<-cbind(clusters,ex)
    temp<-sapply(1:k_n, function(k){temp=subset(clusters, clusters[,1]==k)
                                    sd<-sd(temp[,2], na.rm = T)
                                    return(sd)})
    
    > temp
     [1]  0.40191848  4.10391341  0.00000000  0.06097108  1.57499631 12.59800291
     [7]  0.00000000  0.00000000          NA  0.00000000  0.00000000
    

    One of them is NA, and if you look at which cluster is giving you NA, this cluster has only one member:

    clusters[clusters[,1]==which(is.na(temp)),]
       clus$cluster   ex
    69            9 1960
    

    If we look at your data:

    table(ex)
    ex
       0  801 1100 1200 1350 1400 1401 1500 1560 1600 1800 1900 1901 1920 1960 2000 
       2    2    2  123   36   21    5   22    1   94    4  147   18    1    1  268 
    2001 
       1
    

    I think it's possible if you increase k, you might end up with one of the clusters having only 1 member which allows convergence.

    One way I can suggest is to increase the number of starts:

    STARTS = seq(50,500,by=50)
    # we test over 50 reps, how many single clusters we get
    n_equal_one = sapply(STARTS,function(S){
    replicate(50,sum(kmeans(ex,k_n, nstart=S,iter.max = 15 )$size==1))
    })
    
    plot(STARTS,colMeans(n_equal_one),ylab="Average proportion of singleton cluster")
    

    enter image description here

    So if you try nstart = 400 or 500, you will avoid singleton (cluster with n=1), but if you data becomes more sparse, it might be unavoidable..

    dput(ex)
    
    c(1400, 1400, 2000, 2000, 2000, 2001, 1400, 2000, 2000, 2000, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1400, 1400, 1401, 
    2000, 2000, 2000, 2000, 1401, 1401, 2000, 2000, 2000, 2000, 2000, 
    2000, 2000, 2000, 801, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
    2000, 1400, 1400, 2000, 2000, 2000, 1400, 1400, 2000, 2000, 2000, 
    1200, 1600, 1500, 1960, 1350, 1900, 1900, 1900, 1350, 2000, 2000, 
    2000, 2000, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 
    1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 
    1200, 2000, 2000, 2000, 2000, 2000, 2000, 1900, 1350, 1900, 1900, 
    1350, 1350, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1200, 1200, 
    1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 
    1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 2000, 2000, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1920, 1350, 1350, 
    1900, 1900, 1900, 1900, 1900, 1350, 2000, 2000, 2000, 2000, 2000, 
    2000, 2000, 1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 
    1600, 1600, 1600, 1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 
    1200, 1200, 1200, 1200, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
    1900, 1350, 1900, 1900, 1900, 1900, 1350, 2000, 2000, 2000, 2000, 
    2000, 2000, 2000, 2000, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 
    1600, 1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 
    1200, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
    1900, 1901, 2000, 2000, 2000, 2000, 1200, 1200, 1200, 1200, 1200, 
    1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 
    1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 
    1200, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1900, 
    1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
    1900, 1500, 1500, 1500, 1500, 2000, 1350, 1900, 1900, 1900, 1350, 
    2000, 2000, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
    1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1500, 1500, 1400, 
    1350, 1350, 1900, 1900, 1900, 1350, 2000, 2000, 2000, 2000, 2000, 
    2000, 1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 
    1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 2000, 2000, 2000, 
    2000, 2000, 2000, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
    1900, 1900, 1900, 1900, 1900, 1500, 1500, 1500, 1500, 0, 0, 1900, 
    1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
    1900, 1900, 1500, 1500, 1500, 2000, 2000, 2000, 2000, 2000, 1200, 
    1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 
    1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 
    2000, 2000, 2000, 2000, 2000, 1900, 1900, 1900, 1900, 1900, 1900, 
    1500, 1500, 1500, 1400, 1901, 1901, 1901, 1901, 2000, 1350, 1900, 
    1900, 1900, 1350, 1350, 2000, 2000, 2000, 2000, 1500, 1560, 1900, 
    1900, 1900, 1900, 1900, 1900, 1400, 2000, 1350, 1900, 1900, 1900, 
    1350, 2000, 2000, 2000, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 
    1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 2000, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1400, 1900, 
    1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
    1900, 1900, 1900, 1900, 1900, 1900, 1500, 1500, 1500, 1500, 2000, 
    1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 2000, 1350, 1900, 
    1900, 1900, 1350, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
    2000, 2000, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1600, 
    1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 
    1600, 1900, 1900, 1350, 1350, 1350, 1350, 1350, 1900, 2000, 1350, 
    1900, 1900, 1900, 1900, 1900, 1900, 1350, 1350, 2000, 2000, 2000, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 1200, 1200, 1200, 1200, 
    1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 
    1600, 1600, 1600, 1600, 1600, 1900, 1900, 801, 1400, 1400, 1400, 
    1400, 1901, 1901, 1900, 1901, 1901, 1900, 1901, 1901, 1901, 1900, 
    1900, 1900, 1900, 1901, 1900, 1900, 1901, 1901, 1900, 1901, 1901, 
    1901, 1350, 1350, 1350, 1350, 1350, 1350, 2000, 2000, 2000, 2000, 
    2000, 2000, 2000, 2000, 2000, 1400, 1400, 1401, 1401, 2000, 2000, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1100, 1100, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
    2000, 2000, 2000, 2000, 1800, 2000, 1800, 1800, 1400, 1200, 1400, 
    1600, 1800, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 
    2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
    )