I would like to do hierarchical clustering by row and then by column. I came up with this total hack of a solution:
#! /path/to/my/Rscript --vanilla
args <- commandArgs(TRUE)
mtxf.in <- args[1]
clusterMethod <- args[2]
mtxf.out <- args[3]
mtx <- read.table(mtxf.in, as.is=T, header=T, stringsAsFactors=T)
mtx.hc <- hclust(dist(mtx), method=clusterMethod)
mtx.clustered <- as.data.frame(mtx[mtx.hc$order,])
mtx.c.colnames <- colnames(mtx.clustered)
rownames(mtx.clustered) <- mtx.clustered$topLeftColumnHeaderName
mtx.clustered$topLeftColumnHeaderName <- NULL
mtx.c.t <- as.data.frame(t(mtx.clustered), row.names=names(mtx))
mtx.c.t.hc <- hclust(dist(mtx.c.t), method=clusterMethod)
mtx.c.t.c <- as.data.frame(mtx.c.t[mtx.c.t.hc$order,])
mtx.c.t.c.t <- as.data.frame(t(mtx.c.t.c))
mtx.c.t.c.t.colnames <- as.vector(names(mtx.c.t.c.t))
names(mtx.c.t.c.t) <- mtx.c.colnames[as.numeric(mtx.c.t.c.t.colnames) + 1]
write.table(mtx.c.t.c.t, file=mtxf.out, sep='\t', quote=F, row.names=T)
The variables mtxf.in
and mtxf.out
represent the input matrix and clustered output matrix files, respectively. The variable clusterMethod
is one of the hclust
methods, such as single
, average
, etc.
As an example input, here's a data matrix:
topLeftColumnHeaderName col1 col2 col3 col4 col5 col6
row1 0 3 0 0 0 3
row2 6 6 6 6 6 6
row3 0 3 0 0 0 3
row4 6 6 6 6 6 6
row5 0 3 0 0 0 3
row6 0 3 0 0 0 3
Running this script, I lose my top-left corner element from mtxf.in
. Here's the output that comes out of this script:
col5 col4 col1 col3 col2 col6
row6 0 0 0 0 3 3
row5 0 0 0 0 3 3
row1 0 0 0 0 3 3
row3 0 0 0 0 3 3
row2 6 6 6 6 6 6
row4 6 6 6 6 6 6
My questions: In addition to looking for a way to preserve the original structure of the input matrix file, I also don't know how much memory this consumes or whether there are faster and cleaner, more "R"-like ways for doing this.
Is it really this hard to cluster by rows and columns in R? Are there constructive ways to improve this script? Thanks for your advice.
Once you have your data cleaned (i.e. removed the first column), this really just requires three lines of code:
Clean data (assign row names from first column, then remove first column):
dat <- mtfx.in
rownames(dat) <- dat[, 1]
dat <- dat[, -1]
Cluster and reorder:
row.order <- hclust(dist(dat))$order
col.order <- hclust(dist(t(dat)))$order
dat[row.order, col.order]
Results:
col5 col4 col1 col3 col2 col6
row6 0 0 0 0 3 3
row5 0 0 0 0 3 3
row1 0 0 0 0 3 3
row3 0 0 0 0 3 3
row2 6 6 6 6 6 6
row4 6 6 6 6 6 6