Search code examples
rmatrixfrequency

r Counting occurences of specific number between specific number


I have a matrix where each element is eather 0 or 1. I would like to obtain the frequencies of consecutive occurences of 0's in each row, given the last 0 in the sequence is followed by a 1.

For example:

A row with: 0, 1 , 0, 1, 0, 0

The expected result should be:

Consecutive 0's of length: 1

Frequency : 2

Another row with: 0, 1, 0, 0, 1, 0, 0, 0, 1

The expected result:

Consecutive 0's of length: 1 2 3

Frequency: 1 1 1

A further objective is then to sum the frequencies of the same length in order to know how many times a single 0 was followed by a 1, two consecutive 0's where followed by a 1 etc.

Here is an exemplary matrix on which I would like to apply the routine:

m = matrix( c(1, 0, 1, 1, 1, 1, 0, 0, 0,  0,
      1, 1, 1, 1, 0, 1, 0, 0, 0,  0,
      1, 0, 0, 0, 1, 1, 1, 0, 0,  0,
      0,  1, 0, 0, 0, 0, 0, 1, 1, 1,
      1, 1, 1, 0, 0, 0, 0, 0, 1,  0,
      1, 0, 0, 0, 0, 0, 1, 1, 0,  0),

      ncol = 10, nrow = 6, byrow=TRUE)

The expected result should then be like the matrix below:

result = matrix( c(3, 0, 1, 0, 3, 0, 0, 0, 0, 0), ncol=10, nrow=1)
colnames(result) <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")

Where the column names are the lengths of consecutive 0's (followed by a 1) and the matrix entries the corresponding frequencies.

Note that I have a very large data matrix, and if possible I'd like to avoid loops. Thanks for any hints, comments and propositions.


Solution

  • Using base functions. The complication is you are discarding trailing zeros that do not end with 1.

    Explanation in line.

    set.seed(13L)
    numRows <-  10e4
    numCols <- 10
    m <- matrix(sample(c(0L, 1L), numRows*numCols, replace=TRUE),
        byrow=TRUE, ncol = numCols, nrow = numRows)
    #add boundary conditions of all zeros and all ones
    m <- rbind(rep(0L, numCols), rep(1L, numCols), m)
    #head(m)
    
    rStart <- Sys.time()
    lens <- unlist(apply(m, 1, function(x) {
        #find the position of the last 1 while handling boundary condition of all zeros
        idx <- which(x==1)
        endidx <- if (length(idx) == 0) length(x) else max(idx)
        beginidx <- if(length(idx)==0) 1 else min(idx)
    
        #tabulate the frequencies of running 0s.
        runlen <- rle(x[beginidx:endidx])
        list(table(runlen$lengths[runlen$values==0]))
    }))
    
    #tabulating results
    res <- aggregate(lens, list(names(lens)), FUN=sum)
    ans <- setNames(res$x[match(1:ncol(m), res$Group.1)], 1:ncol(m))
    ans[is.na(ans)] <- 0
    ans
    #     1      2      3      4      5      6      7      8      9     10 
    #100108  43559  18593   7834   3177   1175    387    103      0    106 
    
    rEnd <- Sys.time()
    print(paste0(round(rEnd - rStart, 2), attr(rEnd - rStart, "units")))
    #[1] "27.67secs"
    

    Do let me know the performance after running on the large matrix.