I have a vector V with n elements, each element can be an integer between 1 and N. Given this vector I'd like to construct an N×n matrix W in which column i contains the frequencies of the integers between 1 and N as they appear in the subvector V[1:i].
For example, suppose N=5 and n=7, and V=c(3,1,4,1,2,1,4). Then my matrix W would have elements
0,1,1,2,2,3,3
0,0,0,0,1,1,1
1,1,1,1,1,1,1
0,0,1,1,1,1,2
0,0,0,0,0,0,0
because integer 1 (first row) appears: 0 times in V[1], once in V[1:2], once in V[1:3], twice in V[1:4], twice in V[1:5], three times in V[1:6], three times in V[1:7], etc.
I could this with a for
loop, using table
and factor
for example:
N <- 5
n <- 7
V <- c(3,1,4,1,2,1,4)
W <- matrix(NA,N,n)
for(i in 1:n){
W[,i] <- as.vector(table(factor(V[1:i], levels=1:N)))
}
which in fact gives
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0 1 1 2 2 3 3
[2,] 0 0 0 0 1 1 1
[3,] 1 1 1 1 1 1 1
[4,] 0 0 1 1 1 1 2
[5,] 0 0 0 0 0 0 0
But I wonder if there's some cleverer, faster way that doesn't use a for loop: my N and n are of the order of 100 or 1000.
Any other insight to improve the code above is also welcome (my knowledge of R is still very basic).
Cheers!
One option with base R is:
V <- c(3, 1, 4, 1, 2, 1, 4)
N <- 5
sapply(seq_along(V),
function(i) sapply(seq_len(N), function(j) sum(V[seq_len(i)] == j)))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 0 1 1 2 2 3 3
# [2,] 0 0 0 0 1 1 1
# [3,] 1 1 1 1 1 1 1
# [4,] 0 0 1 1 1 1 2
# [5,] 0 0 0 0 0 0 0
How it works
seq_along(V)
: This is a wrapper for 1:length(V)
, i.e. it returns a vector which sequence from 1 to the length of your vector V. If you are sure, you vector V is non-empty you can also use 1:length(V)
here (or 1:n
in your case)
seq_len(N)
: Similar to seq_along
, but it returns 1:N
. If you're sure thata N is non-negative, than you can also use 1:N
.
sapply
: This is a function from the awesome *apply
-family. It takes a vector or list and applys the function which is specified to each element of this vector/list. sapply
returns a simplyfied sttructure, which is in our case a vector for the inner sapply-call and a matrix for the complete call.
sum(V[seq_len(i)] == j)
: Here we sum over the logical vector, which compares each 'sub-vector' V[1:i]
with j
. By summing over a logical vector, we simply count the number of TRUE
s.