Search code examples
rsortingsumcol

Sort columns by column sums, identical columns adjacent


I have a dataset consisting of dichotomous values. It's a very large dataset, but here's an example:

var1 <- c(1, 0, 1, 1, 0)
var2 <- c(1, 1, 1, 1, 1)
var3 <- c(0, 0, 1, 1, 0)
var4 <- c(0, 0, 1, 1, 0)
var5 <- c(1, 1, 0, 0, 0)
dat <- data.frame(var1,var2,var3,var4,var5)
dat <- as.matrix(dat)

I'm trying to incorporate two commands into one. First, I want to cluster the columns such that identical columns are lumped together. Second, I want the columns to be ordered by column sums. I can do one or the other, but not both.

So, the output should look like this:

var2 var1 var5 var3 var4
   1    1    1    0    0
   1    0    1    0    0
   1    1    0    1    1
   1    1    0    1    1
   1    0    0    0    0

The highest column sums need not be located on the left.

I tried using this command:

 csums <- dat[,order(colSums(dat,na.rm=TRUE))]

But the columns do not cluster by similarity. Perhaps there's a way to cluster based on similarity, conditioned on column sum.


Solution

  • Here's a bit of a strange solution. You can secondarily order by a collapsed string representation of the columns, which will serve as a tiebreaker for column sets that have equal colSums(). This will ensure that identical columns are clustered together, as they will lexicographically sort next to each other.

    dat[,order(decreasing=T,colSums(dat,na.rm=T),apply(dat,2L,paste,collapse=''))];
    ##      var2 var1 var5 var3 var4
    ## [1,]    1    1    1    0    0
    ## [2,]    1    0    1    0    0
    ## [3,]    1    1    0    1    1
    ## [4,]    1    1    0    1    1
    ## [5,]    1    0    0    0    0