Search code examples
rcluster-analysishierarchical-clustering

Sparse clustering using the sparcl package in R


I am using the sparcl package written by Witten and Tibshirani based on their paper:

Witten DM and R Tibshirani (2010) A framework for feature selection in clustering. Journal of the American Statistical Association 105(490): 713-726

I look into the example under the function HierarchicalSparseCluster:

# Generate 2-class data
set.seed(1)
x <- matrix(rnorm(100*50),ncol=50)
y <- c(rep(1,50),rep(2,50))
x[y==1,1:25] <- x[y==1,1:25]+2

# Do tuning parameter selection for sparse hierarchical clustering
perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:6),nperms=5)

# Perform sparse hierarchical clustering
sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists, wbound=perm.out$bestw, method="complete")

Now I check dim(sparsehc$dists) and it returns 4950 and 50. From the simulation set-up, we know that n=100 and p=50. Also, according to the manual, the returned value dists is a (n*n)xp dissimilarity matrix for the data matrix x. Obviously the row dimension is not n*n as it should be 100*100=10000 instead of 4950. Did I misunderstand something? Thank you very much!


Solution

  • It seems to be the mistake in sparcl help page: dimensions of dissimilarity matrix dist are n2xp, where n2=n*(n-1)/2. Indeed, we don't need nxn matrix of distances, but only part of this matrix over the main diagonal.

    Sources of sparcl support what I said above:

    distfun.R

    distfun=function(x){
    #if(!is.loaded("distfun")){
    #  dyn.load("distfun.so")
    #}
    n<-nrow(x)
    p <- ncol(x)
    x[is.na(x)]=0
    mode(x)="single"
    n2=n*(n-1)/2
    junk=.Fortran("distfun",
             x,
            as.integer(n),
           as.integer(p),
           as.integer(n2),
           d=single(n2*p), PACKAGE="sparcl"
    )
    return(junk$d)
    }
    

    Here we can see how n2 is calculated and passed to Fortran function.

    distfun.f

    C Output from Public domain Ratfor, version 1.0
          subroutine distfun(x,n,p,n2,d)
          implicit double precision (a-h,o-z)
          integer n,p,n2
          real x(n,p),d(n2,p)
          ii=0
          do23000 i=1,n-1
          do23002 ip=i+1,n
          ii=ii+1
          do23004 j=1,p
          d(ii,j)=abs(x(i,j)-x(ip,j))
    23004 continue
    23005 continue
    23002 continue
    23003 continue
    23000 continue
    23001 continue
          return
          end
    

    Here for each feature in dist matrix there is a column of size n2 constructed, that holds a sequence of pairwise distances between objects. For example, for n=4, p=2 and n2=4*3/2=6 the final matrix will be 6x2 and designed like this:

        |    1     |     2    |
    ---------------------------
      1 | d(1,2)_1 | d(1,2)_2 |
      2 | d(1,3)_1 | d(1,3)_2 |
      3 | d(1,4)_1 | d(1,4)_2 |
      4 | d(2,3)_1 | d(2,3)_2 |
      5 | d(2,4)_1 | d(2,4)_2 |
      6 | d(3,4)_1 | d(3,4)_2 |
    

    Where, say, d(2,4)_1 is a distance between 2nd and 4th object for 1st feature.