Search code examples
rcluster-analysishierarchicalrscript

Is there a better way to hierarchically cluster in R?


I would like to do hierarchical clustering by row and then by column. I came up with this total hack of a solution:

#! /path/to/my/Rscript --vanilla
args <- commandArgs(TRUE)
mtxf.in <- args[1]
clusterMethod <- args[2]
mtxf.out <- args[3]

mtx <- read.table(mtxf.in, as.is=T, header=T, stringsAsFactors=T)

mtx.hc <- hclust(dist(mtx), method=clusterMethod)
mtx.clustered <- as.data.frame(mtx[mtx.hc$order,])
mtx.c.colnames <- colnames(mtx.clustered)
rownames(mtx.clustered) <- mtx.clustered$topLeftColumnHeaderName
mtx.clustered$topLeftColumnHeaderName <- NULL
mtx.c.t <- as.data.frame(t(mtx.clustered), row.names=names(mtx))
mtx.c.t.hc <- hclust(dist(mtx.c.t), method=clusterMethod)
mtx.c.t.c <- as.data.frame(mtx.c.t[mtx.c.t.hc$order,])
mtx.c.t.c.t <- as.data.frame(t(mtx.c.t.c))
mtx.c.t.c.t.colnames <- as.vector(names(mtx.c.t.c.t))
names(mtx.c.t.c.t) <- mtx.c.colnames[as.numeric(mtx.c.t.c.t.colnames) + 1]

write.table(mtx.c.t.c.t, file=mtxf.out, sep='\t', quote=F, row.names=T)

The variables mtxf.in and mtxf.out represent the input matrix and clustered output matrix files, respectively. The variable clusterMethod is one of the hclust methods, such as single, average, etc.

As an example input, here's a data matrix:

topLeftColumnHeaderName col1    col2    col3    col4    col5    col6
row1    0       3       0       0       0       3
row2    6       6       6       6       6       6
row3    0       3       0       0       0       3
row4    6       6       6       6       6       6
row5    0       3       0       0       0       3
row6    0       3       0       0       0       3

Running this script, I lose my top-left corner element from mtxf.in. Here's the output that comes out of this script:

col5    col4    col1    col3    col2    col6
row6    0       0       0       0       3       3
row5    0       0       0       0       3       3
row1    0       0       0       0       3       3
row3    0       0       0       0       3       3
row2    6       6       6       6       6       6
row4    6       6       6       6       6       6

My questions: In addition to looking for a way to preserve the original structure of the input matrix file, I also don't know how much memory this consumes or whether there are faster and cleaner, more "R"-like ways for doing this.

Is it really this hard to cluster by rows and columns in R? Are there constructive ways to improve this script? Thanks for your advice.


Solution

  • Once you have your data cleaned (i.e. removed the first column), this really just requires three lines of code:

    Clean data (assign row names from first column, then remove first column):

    dat <- mtfx.in
    rownames(dat) <- dat[, 1]
    dat <- dat[, -1]
    

    Cluster and reorder:

    row.order <- hclust(dist(dat))$order
    col.order <- hclust(dist(t(dat)))$order
    
    dat[row.order, col.order]
    

    Results:

         col5 col4 col1 col3 col2 col6
    row6    0    0    0    0    3    3
    row5    0    0    0    0    3    3
    row1    0    0    0    0    3    3
    row3    0    0    0    0    3    3
    row2    6    6    6    6    6    6
    row4    6    6    6    6    6    6