I have a matrix where each element is eather 0 or 1. I would like to obtain the frequencies of consecutive occurences of 0's in each row, given the last 0 in the sequence is followed by a 1.
For example:
A row with: 0, 1 , 0, 1, 0, 0
Consecutive 0's of length: 1
Frequency : 2
Another row with: 0, 1, 0, 0, 1, 0, 0, 0, 1
Consecutive 0's of length: 1 2 3
Frequency: 1 1 1
A further objective is then to sum the frequencies of the same length in order to know how many times a single 0 was followed by a 1, two consecutive 0's where followed by a 1 etc.
Here is an exemplary matrix on which I would like to apply the routine:
m = matrix( c(1, 0, 1, 1, 1, 1, 0, 0, 0, 0,
1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
1, 0, 0, 0, 0, 0, 1, 1, 0, 0),
ncol = 10, nrow = 6, byrow=TRUE)
result = matrix( c(3, 0, 1, 0, 3, 0, 0, 0, 0, 0), ncol=10, nrow=1)
colnames(result) <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
Where the column names are the lengths of consecutive 0's (followed by a 1) and the matrix entries the corresponding frequencies.
Note that I have a very large data matrix, and if possible I'd like to avoid loops. Thanks for any hints, comments and propositions.
Using base functions. The complication is you are discarding trailing zeros that do not end with 1.
Explanation in line.
set.seed(13L)
numRows <- 10e4
numCols <- 10
m <- matrix(sample(c(0L, 1L), numRows*numCols, replace=TRUE),
byrow=TRUE, ncol = numCols, nrow = numRows)
#add boundary conditions of all zeros and all ones
m <- rbind(rep(0L, numCols), rep(1L, numCols), m)
#head(m)
rStart <- Sys.time()
lens <- unlist(apply(m, 1, function(x) {
#find the position of the last 1 while handling boundary condition of all zeros
idx <- which(x==1)
endidx <- if (length(idx) == 0) length(x) else max(idx)
beginidx <- if(length(idx)==0) 1 else min(idx)
#tabulate the frequencies of running 0s.
runlen <- rle(x[beginidx:endidx])
list(table(runlen$lengths[runlen$values==0]))
}))
#tabulating results
res <- aggregate(lens, list(names(lens)), FUN=sum)
ans <- setNames(res$x[match(1:ncol(m), res$Group.1)], 1:ncol(m))
ans[is.na(ans)] <- 0
ans
# 1 2 3 4 5 6 7 8 9 10
#100108 43559 18593 7834 3177 1175 387 103 0 106
rEnd <- Sys.time()
print(paste0(round(rEnd - rStart, 2), attr(rEnd - rStart, "units")))
#[1] "27.67secs"
Do let me know the performance after running on the large matrix.