Search code examples
rmatrixsetmax

R: Find set of columns which contain most 1s in matrix of 0 and 1


I have a matrix of 1s and 0s where the rows are individuals and the columns are events. A 1 indicates that an event happened to an individual and a 0 that it did not.

I want to find which set of (in the example) 5 columns/events that cover the most rows/individuals.

Test Data

#Make test data
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30

My attempt

My initial attempt was just based on combining the set of 5 columns with the highest colMeans:

#Get top 5 columns with highest row coverage
col_set <- head(sort(colMeans(d), decreasing = T), 5)

#Have a look the set
col_set

>
      197       199        59        80        76 
0.2666667 0.2666667 0.2333333 0.2333333 0.2000000 

#Check row coverage of the column set
sum(apply(d[,colnames(d) %in% names(col_set)], 1, sum) > 0) / 30 #top 5

>
[1] 0.7

However this set does not cover the most rows. I tested this by pseudo-random sampling 10.000 different sets of 5 columns, and then finding the set with the highest coverage:

#Get 5 random columns using colMeans as prob in sample
##Random sample 10.000 times
set.seed(123)
result <- lapply(1:10000, function(x){
  col_set2 <- sample(colMeans(d), 5, F, colMeans(d))
  cover <- sum(apply(d[,colnames(d) %in% names(col_set2)], 1, sum) > 0) / 30 #random 5
  list(set = col_set2, cover = cover)
})

##Have a look at the best set
result[which.max(sapply(result, function(x) x[["cover"]]))]

>
[[1]]
[[1]]$set
        59        169        262         68        197 
0.23333333 0.10000000 0.06666667 0.16666667 0.26666667 

[[1]]$cover
[1] 0.7666667

The reason for supplying the colMeans to sample is that the columns with the highest coverages are the ones I am most interested in.

So, using pseudo-random sampling I can collect a set of columns with higher coverage than when just using the top 5 columns. However, since my actual data sets are larger than the example I am looking for a more efficient and rational way of finding the set of columns with the highest coverage.

EDIT

For the interested, I decided to microbenchmark the 3 solutions provided:

#Defining G. Grothendieck's coverage funciton outside his solutions
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30

#G. Grothendieck top solution
solution1 <- function(d){
  cols <- tail(as.numeric(names(sort(colSums(d)))), 20)
  co <- combn(cols, 5)
  itop <- which.max(apply(co, 2, coverage))
  co[, itop]
}

#G. Grothendieck "Older solution"
solution2 <- function(d){
  require(lpSolve)
  ones <- rep(1, 300)
  res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
  m <- matrix(res$solution[1:3000] == 1, 300)
  cols <- which(rowSums(m) > 0)
  co <- combn(cols, 5)
  itop <- which.max(apply(co, 2, coverage))
  co[, itop]
}

#user2554330 solution
bestCols <- function(d, n = 5) {
  result <- numeric(n)
  for (i in seq_len(n)) {
    result[i] <- which.max(colMeans(d))
    d <- d[d[,result[i]] != 1,, drop = FALSE]
  }
  result
}

#Benchmarking...
microbenchmark::microbenchmark(solution1 = solution1(d),
                               solution2 = solution2(d),
                               solution3 = bestCols(d), times = 10)

>
Unit: microseconds
      expr        min         lq       mean      median         uq       max neval
 solution1 390811.850 497155.887 549314.385 578686.3475 607291.286 651093.16    10
 solution2  55252.890  71492.781  84613.301  84811.7210  93916.544 117451.35    10
 solution3    425.922    517.843   3087.758    589.3145    641.551  25742.11    10

Solution

  • The following provides a heuristic to find an approximate solution. Find the N=20 columns, say, with the most ones, cols, and then use brute force to find every subset of 5 columns out of those 20. The subset having the highest coverage is shown below and its coverage is 93.3%.

    coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
    
    N <- 20
    cols <- tail(as.numeric(names(sort(colSums(d)))), N)
    
    co <- combn(cols, 5)
    itop <- which.max(apply(co, 2, coverage))
    co[, itop]
    ## [1]  90 123 197 199 286
    
    coverage(co[, itop])
    ## [1] 0.9333333
    

    Repeating this for N=5, 10, 15 and 20 we get coverages of 83.3%, 86.7%, 90% and 93.3%. The higher the N the better the coverage but the lower the N the less the run time.

    Older solution

    We can approximate the problem with a knapsack problem that chooses the 5 columns with largest numbers of ones using integer linear programming.
    We get the 10 best solutions to this approximate problem, get all columns which are in at least one of the 10 solutions. There are 14 such columns and we then use brute force to find which subset of 5 of the 14 columns has highest coverage.

    library(lpSolve)
    
    ones <- rep(1, 300)
    res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
    
    coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30
    
    # each column of m is logical 300-vector defining possible soln
    m <- matrix(res$solution[1:3000] == 1, 300)
    
    # cols is the set of columns which are in any of the 10 solutions
    cols <- which(rowSums(m) > 0)
    length(cols)
    ## [1] 14
    
    # use brute force to find the 5 best columns among cols
    co <- combn(cols, 5)
    itop <- which.max(apply(co, 2, coverage))
    co[, itop]
    ## [1]  90 123 197 199 286
    coverage(co[, itop])
    ## [1] 0.9333333