I have a dataset consisting of dichotomous values. It's a very large dataset, but here's an example:
var1 <- c(1, 0, 1, 1, 0)
var2 <- c(1, 1, 1, 1, 1)
var3 <- c(0, 0, 1, 1, 0)
var4 <- c(0, 0, 1, 1, 0)
var5 <- c(1, 1, 0, 0, 0)
dat <- data.frame(var1,var2,var3,var4,var5)
dat <- as.matrix(dat)
I'm trying to incorporate two commands into one. First, I want to cluster the columns such that identical columns are lumped together. Second, I want the columns to be ordered by column sums. I can do one or the other, but not both.
So, the output should look like this:
var2 var1 var5 var3 var4
1 1 1 0 0
1 0 1 0 0
1 1 0 1 1
1 1 0 1 1
1 0 0 0 0
The highest column sums need not be located on the left.
I tried using this command:
csums <- dat[,order(colSums(dat,na.rm=TRUE))]
But the columns do not cluster by similarity. Perhaps there's a way to cluster based on similarity, conditioned on column sum.
Here's a bit of a strange solution. You can secondarily order by a collapsed string representation of the columns, which will serve as a tiebreaker for column sets that have equal colSums()
. This will ensure that identical columns are clustered together, as they will lexicographically sort next to each other.
dat[,order(decreasing=T,colSums(dat,na.rm=T),apply(dat,2L,paste,collapse=''))];
## var2 var1 var5 var3 var4
## [1,] 1 1 1 0 0
## [2,] 1 0 1 0 0
## [3,] 1 1 0 1 1
## [4,] 1 1 0 1 1
## [5,] 1 0 0 0 0