Search code examples
rmatrixmaxconditional-statementsrow

R indexing of matrix used to to determine indices of max values taking over 8 hours in R


I am trying to create an R function which selects an entry at random among those entries that have values equal to the maximum and doing this row wise. The trick is that once i select the column for a given row, i no longer want that column to be considered for selection for subsequent rows. I also want to know how many columns had entries that were equal to the rowwise maximum and exactly what that max value was for the row. I have tried many variations on the theme and here's my code as it stands now. The largemat is a large matrix 5000 rows by 20000 columns. I tried to vectorize this but the issue is that it's a dynamic process so the results for row 2 depend on which column was selected for row 1. So i can't just pick row maxes at once because they could change.

Here's an example of the first two rows:

Row 1: .5, .5, 1, 1 Row 2: .6, .8, .7, .9

So i know that the rowmax for Row 1 is 1 and the row max for row 2 is .9. But if i select the fourth column (from the third and fourth from row 1) then I remove that column from possible selection for row 2 (which now has candidates .6, .8, .7)

I am struggling with how to make this more efficient. Any advice would be appreciated. You all are the masters and I am trying to become one. So any advice is so much appreciated!

Here is my current R code:

function(largemat, reordervector, IDvector)
nrowz<-nrow(largemat)
maxvalues<-numeric(nrowz)
numberofmaxes<-integer(nrowz)
idvalue<-integer(nrowz)

#this line randomizes the order of the rows
tempmat<-largemat[reordervector,]
tempsims<-NULL
for (i in 1:nrowz){
tempsims<-which(tempmat[i,]==max(tempmat[i,]))
numberofmaxes[i]<-length(tempsims) 
tempindx<-ifelse(length(tempsims)==1, tempsims, sample(x=tempsims, size=1))
#pick off the largest value
distvalues[i]<-tempmat[i, tempindx]
# record the column id name of the largest value
idvalue[i]<-IDvector[tempindx]
#remove the column so that it cannot be selected again
tempmat<-tempmat[,-tempindx]

list(nm=numberofmaxes, dv=distvalues, ids=idvalue)
 }

The function will generate three vectors each of length nrow(largemat) producing the number of maxes for each row, the id name for the column position in which the max was found for a given row and the value of the maximum from the original matrix.

Here is a small example:

largemat is a matrix:

largemat<-rbind(c(.2 .5  .6 .8 .9  1  1  1),
                c(.3 .4  .8 .9  1 .7  1  1),
                c(.5  1  .6 .6 .9 .9 .8 .1)) 

Assume this matrix has already permuted the rows (so reordervector has already been applied to largemat)

first step: determine which columns have largest value for row 1: (6, 7, 8) second step: randomly select one of these columns (say 7) third step: grab id values corresponding to id name vector for column 7 (and record the maximum value for row 1 was in fact 1) fourth step: shrink the matrix to eliminate column 7 for further consideration and repeat steps on row 2 of the new matrix:

largemat<-rbind(c(.2 .5  .6 .8 .9  1  1),
                c(.3 .4  .8 .9  1  1  1),
                c(.5  1  .6 .6 .9 .8 .1)) 

continue- the resulting vectors of ids will be something like maxes: 1, etc. ids: col7id, etc. (interpreting columns to column ids) numberof maxes would be: 3, etc. (corresponding to the number of columns for a given row that had the max value for that row)


Solution

  • I would create helper functions to complete the task. Your use of ifelse is problematic in the temp creation. Using if is more appropriate. A data.frame output made the most sense to me:

    choose.max  <- function(x, omit=NULL) {
      x[omit] <- -Inf
      xmax      <- which(x == max(x))
      x_col  <- if(length(xmax) == 1L) xmax else sample(xmax, size=1L)
      x_value   <- max(x)
      num_maxes <- length(xmax)
      return(data.frame(col=x_col, max_value=x_value, num_maxes=num_maxes))
    }
    
    max_choice <- function(df) {
      res <- list(choose.max(df[1,,drop=FALSE]))
    
      for(i in 2:nrow(df)) {
        res[[i]] <- choose.max(x=df[i,,drop=FALSE], omit=sapply(res, '[[', "col"))
      }
    
      return(do.call("rbind", res))
    }
    

    Calling the function max_choice will create the data frame, The first column is for the maximum column selected, then the maximum value of that row, and the number of maxes:

    set.seed(143)
    mat <- matrix(sample(1:5, 16, TRUE), 4, 4)
    max_choice(mat)
    #   col max_value num_maxes
    # 1   1         5         2
    # 2   2         5         1
    # 3   4         5         1
    # 4   3         1         1
    

    Edit

    If speed is important, you can get a boost with this edit:

    max_choice <- function(df) {
      res <- vector("list", nrow(df))
      res[[1]] <- choose.max(df[1,,drop=FALSE])
    
      for(i in 2:nrow(df)) {
        res[[i]] <- choose.max(x=df[i,,drop=FALSE], omit=sapply(res[!sapply(res,is.null)], '[[', "col"))
      }
    
      return(do.call("rbind", res))
    }
    

    Edit 2

    This may even be faster still. parallel is a built-in package for parallel processing:

    library(parallel)
    no_cores <- detectCores() - 1
    cl <- makeCluster(no_cores)
    clusterExport(cl, c("mat", "choose.max", "max_choice"))
    fast_res <- parLapply(cl, 1, function(x) max_choice(mat))[[1]]