I want to select columns from a data frame so that the resulting continuous column-sequences are as long as possible, while the number of rows with NAs is as small as possible, because they have to be dropped afterwards.
(The reason I want to do this is, that I want to run TraMineR::seqsubm()
to automatically get a matrix of transition costs (by transition probability) and later run cluster::agnes()
on it. TraMineR::seqsubm()
doesn't like NA
states and cluster::agnes()
with NA
states in the matrix doesn't necessarily make much sense.)
For that purpose I already wrote a working function that computes by principle all possible power-subsets and checks them for NA
s. It works well with this toy data d
which represents a 10x5 matrix:
> d
id X1 X2 X3 X4 X5
1 A 1 11 21 31 41
2 B 2 12 22 32 42
3 C 3 13 23 33 NA
4 D 4 14 24 34 NA
5 E 5 15 25 NA NA
6 F 6 16 26 NA NA
7 G 7 17 NA NA NA
8 H 8 18 NA NA NA
9 I 9 NA NA NA NA
10 J 10 NA NA NA NA
11 K NA NA NA NA NA
The problem now is that I actually want to apply the algorithm to survey data that would represent a 34235 x 17 matrix!
My code has been reviewed on Code Review, but still cannot be applied to the real data.
I am aware that with this approach there would be a huge calculation. (Presumably too huge for non-supercomputers?!)
Does anyone know a more suitable approach?
I show you the already enhanced function by @minem from Code Review:
seqRank2 <- function(d, id = "id") {
require(matrixStats)
# change structure, convert to matrix
ii <- as.character(d[, id])
dm <- d
dm[[id]] <- NULL
dm <- as.matrix(dm)
rownames(dm) <- ii
your.powerset = function(s){
l = vector(mode = "list", length = 2^length(s))
l[[1]] = numeric()
counter = 1L
for (x in 1L:length(s)) {
for (subset in 1L:counter) {
counter = counter + 1L
l[[counter]] = c(l[[subset]], s[x])
}
}
return(l[-1])
}
psr <- your.powerset(ii)
psc <- your.powerset(colnames(dm))
sss <- lapply(psr, function(x) {
i <- ii %in% x
lapply(psc, function(y) dm[i, y, drop = F])
})
cn <- sapply(sss, function(x)
lapply(x, function(y) {
if (ncol(y) == 1) {
if (any(is.na(y))) return(NULL)
return(y)
}
isna2 <- matrixStats::colAnyNAs(y)
if (all(isna2)) return(NULL)
if (sum(isna2) == 0) return(NA)
r <- y[, !isna2, drop = F]
return(r)
}))
scr <- sapply(cn, nrow)
scc <- sapply(cn, ncol)
namesCN <- sapply(cn, function(x) paste0(colnames(x), collapse = ", "))
names(scr) <- namesCN
scr <- unlist(scr)
names(scc) <- namesCN
scc <- unlist(scc)
m <- t(rbind(n.obs = scr, sq.len = scc))
ag <- aggregate(m, by = list(sequence = rownames(m)), max)
ag <- ag[order(-ag$sq.len, -ag$n.obs), ]
rownames(ag) <- NULL
return(ag)
}
Yielding:
> seqRank2(d)
sequence n.obs sq.len
1 X1, X2, X3, X4 4 4
2 X1, X2, X3 6 3
3 X1, X2, X4 4 3
4 X1, X3, X4 4 3
5 X2, X3, X4 4 3
6 X1, X2 8 2
7 X1, X3 6 2
8 X2, X3 6 2
9 X1, X4 4 2
10 X2, X4 4 2
11 X3, X4 4 2
12 X1 10 1
13 X2 8 1
14 X3 6 1
15 X4 4 1
16 X5 2 1
> system.time(x <- seqRank2(d))
user system elapsed
1.93 0.14 2.93
In this case I would choose X1, X2, X3, X4
, X1, X2, X3
or X2, X3, X4
because they're continuous and yield an appropriate number of observations.
Expected output:
So for toy data d
the expected output would be something like:
> seqRank2(d)
sequence n.obs sq.len
1 X1, X2, X3, X4 4 4
2 X1, X2, X3 6 3
3 X2, X3, X4 4 3
4 X1, X2 8 2
5 X2, X3 6 2
6 X3, X4 4 2
7 X1 10 1
8 X2 8 1
9 X3 6 1
10 X4 4 1
11 X5 2 1
And at the end the function should run properly on the huge matrix d.huge
which leads to errors at the moment:
> seqRank2(d.huge)
Error in vector(mode = "list", length = 2^length(s)) :
vector size cannot be infinite
Toy data d
:
d <- structure(list(id = structure(1:11, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), X1 = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, NA), X2 = c(11L, 12L, 13L,
14L, 15L, 16L, 17L, 18L, NA, NA, NA), X3 = c(21L, 22L, 23L, 24L,
25L, 26L, NA, NA, NA, NA, NA), X4 = c(31L, 32L, 33L, 34L, NA,
NA, NA, NA, NA, NA, NA), X5 = c(41L, 42L, NA, NA, NA, NA, NA,
NA, NA, NA, NA)), row.names = c(NA, -11L), class = "data.frame")
Toy data d.huge
:
d.huge <- setNames(data.frame(matrix(1:15.3e5, 3e4, 51)),
c("id", paste0("X", 1:50)))
d.huge[, 41:51] <- lapply(d.huge[, 41:51], function(x){
x[which(x %in% sample(x, .05*length(x)))] <- NA
x
})
Appendix (see comments latest answer):
d.huge <- read.csv("d.huge.csv")
d.huge.1 <- d.huge[sample(nrow(d.huge), 3/4*nrow(d.huge)), ]
d1 <- seqRank3(d.huge.1, 1.27e-1, 1.780e1)
d2 <- d1[complete.cases(d1), ]
dim(d2)
names(d2)
This takes less than one second on the huge data
l1 = combn(2:length(d), 2, function(x) d[x[1]:x[2]], simplify = FALSE)
# If you also need "combinations" of only single columns, then uncomment the next line
# l1 = c(d[-1], l1)
l2 = sapply(l1, function(x) sum(complete.cases(x)))
score = sapply(1:length(l1), function(i) NCOL(l1[[i]]) * l2[i])
best_score = which.max(score)
best = l1[[best_score]]
The question was unclear about how to rank the various combinations. We can use different scoring formulae to generate different preferences. For example, to weight number of rows versus columns separately we can do
col_weight = 2
row_weight = 1
score = sapply(1:length(l1), function(i) col_weight*NCOL(l1[[i]]) + row_weight * l2[i])