I have a serie of observations that I cluster using kmeans. Then, I investigate the standard deviation (sd) inside each clusters and get the max.
If I run several time the same code, sometimes, a NA appears.
Reducing k_n (number of clusters) produce the same effect, increasing is worse; with/without na.rm=T
doesn't change.
Can someone explain me what I'm doing wrong, please ?
The code:
k_n <-11
clusters=as.data.frame(kmeans(ex,k_n, nstart=50,iter.max = 15 )$cluster)
clusters<-cbind(clusters,ex)
temp<-sapply(1:k_n, function(k){temp=subset(clusters, clusters[,1]==k)
sd<-sd(temp[,2], na.rm = T)
return(sd)})
max(temp)
temp
Here are the results of three runs. As you can see, the third trial returns a NA, the two others don't.
And here are the data "ex":
1400 1400 2000 2000 2000 2001 1400 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1400 1401 2000 2000 2000 2000 1401 1401 2000 2000 2000 2000 2000 2000 2000 2000 801 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1400 2000 2000 2000 1400 1400 2000 2000 2000 1200 1600 1500 1960 1350 1900 1900 1900 1350 2000 2000 2000 2000 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 1900 1350 1900 1900 1350 1350 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1920 1350 1350 1900 1900 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 1900 1350 1900 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 2000 2000 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1900 1901 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 2000 1350 1900 1900 1900 1350 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1400 1350 1350 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 0 0 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 1900 1900 1900 1900 1900 1900 1500 1500 1500 1400 1901 1901 1901 1901 2000 1350 1900 1900 1900 1350 1350 2000 2000 2000 2000 1500 1560 1900 1900 1900 1900 1900 1900 1400 2000 1350 1900 1900 1900 1350 2000 2000 2000 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1200 1200 1200 1200 1200 1200 1200 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 1500 1500 1500 1500 2000 1900 1900 1900 1900 1900 1900 1900 1900 2000 1350 1900 1900 1900 1350 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1900 1900 1350 1350 1350 1350 1350 1900 2000 1350 1900 1900 1900 1900 1900 1900 1350 1350 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1200 1200 1200 1200 1200 1200 1200 1200 1200 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 1900 1900 801 1400 1400 1400 1400 1901 1901 1900 1901 1901 1900 1901 1901 1901 1900 1900 1900 1900 1901 1900 1900 1901 1901 1900 1901 1901 1901 1350 1350 1350 1350 1350 1350 2000 2000 2000 2000 2000 2000 2000 2000 2000 1400 1400 1401 1401 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1100 1100 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 1800 2000 1800 1800 1400 1200 1400 1600 1800 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000
You get NA for sd because there is only one member in the cluster.
Using your example:
set.seed(1)
clus = kmeans(ex,k_n, nstart=50,iter.max = 15 )
Using your code:
clusters=as.data.frame(clus$cluster)
clusters<-cbind(clusters,ex)
temp<-sapply(1:k_n, function(k){temp=subset(clusters, clusters[,1]==k)
sd<-sd(temp[,2], na.rm = T)
return(sd)})
> temp
[1] 0.40191848 4.10391341 0.00000000 0.06097108 1.57499631 12.59800291
[7] 0.00000000 0.00000000 NA 0.00000000 0.00000000
One of them is NA, and if you look at which cluster is giving you NA, this cluster has only one member:
clusters[clusters[,1]==which(is.na(temp)),]
clus$cluster ex
69 9 1960
If we look at your data:
table(ex)
ex
0 801 1100 1200 1350 1400 1401 1500 1560 1600 1800 1900 1901 1920 1960 2000
2 2 2 123 36 21 5 22 1 94 4 147 18 1 1 268
2001
1
I think it's possible if you increase k, you might end up with one of the clusters having only 1 member which allows convergence.
One way I can suggest is to increase the number of starts:
STARTS = seq(50,500,by=50)
# we test over 50 reps, how many single clusters we get
n_equal_one = sapply(STARTS,function(S){
replicate(50,sum(kmeans(ex,k_n, nstart=S,iter.max = 15 )$size==1))
})
plot(STARTS,colMeans(n_equal_one),ylab="Average proportion of singleton cluster")
So if you try nstart = 400 or 500, you will avoid singleton (cluster with n=1), but if you data becomes more sparse, it might be unavoidable..
dput(ex)
c(1400, 1400, 2000, 2000, 2000, 2001, 1400, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1400, 1400, 1401,
2000, 2000, 2000, 2000, 1401, 1401, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 801, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 1400, 1400, 2000, 2000, 2000, 1400, 1400, 2000, 2000, 2000,
1200, 1600, 1500, 1960, 1350, 1900, 1900, 1900, 1350, 2000, 2000,
2000, 2000, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600,
1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200,
1200, 2000, 2000, 2000, 2000, 2000, 2000, 1900, 1350, 1900, 1900,
1350, 1350, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1200, 1200,
1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600,
1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1920, 1350, 1350,
1900, 1900, 1900, 1900, 1900, 1350, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600,
1600, 1600, 1600, 1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200,
1200, 1200, 1200, 1200, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
1900, 1350, 1900, 1900, 1900, 1900, 1350, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 1200, 1200, 1600, 1600, 1600, 1600, 1600,
1600, 1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200,
1200, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
1900, 1901, 2000, 2000, 2000, 2000, 1200, 1200, 1200, 1200, 1200,
1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600, 1600,
1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200,
1200, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1900,
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900,
1900, 1500, 1500, 1500, 1500, 2000, 1350, 1900, 1900, 1900, 1350,
2000, 2000, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900,
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1500, 1500, 1400,
1350, 1350, 1900, 1900, 1900, 1350, 2000, 2000, 2000, 2000, 2000,
2000, 1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600,
1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 2000, 2000, 2000,
2000, 2000, 2000, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900,
1900, 1900, 1900, 1900, 1900, 1500, 1500, 1500, 1500, 0, 0, 1900,
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900,
1900, 1900, 1500, 1500, 1500, 2000, 2000, 2000, 2000, 2000, 1200,
1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600,
1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200,
2000, 2000, 2000, 2000, 2000, 1900, 1900, 1900, 1900, 1900, 1900,
1500, 1500, 1500, 1400, 1901, 1901, 1901, 1901, 2000, 1350, 1900,
1900, 1900, 1350, 1350, 2000, 2000, 2000, 2000, 1500, 1560, 1900,
1900, 1900, 1900, 1900, 1900, 1400, 2000, 1350, 1900, 1900, 1900,
1350, 2000, 2000, 2000, 1200, 1200, 1600, 1600, 1600, 1600, 1600,
1600, 1600, 1600, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1400, 1900,
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900,
1900, 1900, 1900, 1900, 1900, 1900, 1500, 1500, 1500, 1500, 2000,
1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 2000, 1350, 1900,
1900, 1900, 1350, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1200, 1600,
1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600,
1600, 1900, 1900, 1350, 1350, 1350, 1350, 1350, 1900, 2000, 1350,
1900, 1900, 1900, 1900, 1900, 1900, 1350, 1350, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 1200, 1200, 1200, 1200,
1200, 1200, 1200, 1200, 1200, 1600, 1600, 1600, 1600, 1600, 1600,
1600, 1600, 1600, 1600, 1600, 1900, 1900, 801, 1400, 1400, 1400,
1400, 1901, 1901, 1900, 1901, 1901, 1900, 1901, 1901, 1901, 1900,
1900, 1900, 1900, 1901, 1900, 1900, 1901, 1901, 1900, 1901, 1901,
1901, 1350, 1350, 1350, 1350, 1350, 1350, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 1400, 1400, 1401, 1401, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 1100, 1100,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 1800, 2000, 1800, 1800, 1400, 1200, 1400,
1600, 1800, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000
)