Search code examples
rfor-loopcountfrequency-distributioncontingency

R: extract sequence of frequencies (contingency tables) from increasing parts of a vector


I have a vector V with n elements, each element can be an integer between 1 and N. Given this vector I'd like to construct an N×n matrix W in which column i contains the frequencies of the integers between 1 and N as they appear in the subvector V[1:i].

For example, suppose N=5 and n=7, and V=c(3,1,4,1,2,1,4). Then my matrix W would have elements

0,1,1,2,2,3,3  
0,0,0,0,1,1,1  
1,1,1,1,1,1,1  
0,0,1,1,1,1,2  
0,0,0,0,0,0,0  

because integer 1 (first row) appears: 0 times in V[1], once in V[1:2], once in V[1:3], twice in V[1:4], twice in V[1:5], three times in V[1:6], three times in V[1:7], etc.

I could this with a for loop, using table and factor for example:

N <- 5
n <- 7
V <- c(3,1,4,1,2,1,4)
W <- matrix(NA,N,n)

for(i in 1:n){
    W[,i] <- as.vector(table(factor(V[1:i], levels=1:N)))
}

which in fact gives

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0    1    1    2    2    3    3
[2,]    0    0    0    0    1    1    1
[3,]    1    1    1    1    1    1    1
[4,]    0    0    1    1    1    1    2
[5,]    0    0    0    0    0    0    0

But I wonder if there's some cleverer, faster way that doesn't use a for loop: my N and n are of the order of 100 or 1000.

Any other insight to improve the code above is also welcome (my knowledge of R is still very basic).

Cheers!


Solution

  • One option with base R is:

    V <- c(3, 1, 4, 1, 2, 1, 4)
    N <- 5
    
    sapply(seq_along(V), 
           function(i) sapply(seq_len(N), function(j) sum(V[seq_len(i)] == j)))
    
    #      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
    # [1,]    0    1    1    2    2    3    3
    # [2,]    0    0    0    0    1    1    1
    # [3,]    1    1    1    1    1    1    1
    # [4,]    0    0    1    1    1    1    2
    # [5,]    0    0    0    0    0    0    0
    

    How it works
    seq_along(V): This is a wrapper for 1:length(V), i.e. it returns a vector which sequence from 1 to the length of your vector V. If you are sure, you vector V is non-empty you can also use 1:length(V) here (or 1:n in your case)

    seq_len(N): Similar to seq_along, but it returns 1:N. If you're sure thata N is non-negative, than you can also use 1:N.

    sapply: This is a function from the awesome *apply-family. It takes a vector or list and applys the function which is specified to each element of this vector/list. sapply returns a simplyfied sttructure, which is in our case a vector for the inner sapply-call and a matrix for the complete call.

    sum(V[seq_len(i)] == j): Here we sum over the logical vector, which compares each 'sub-vector' V[1:i] with j. By summing over a logical vector, we simply count the number of TRUEs.