I have a list of events with (x,y) coordinates, where x ranges from 1 to 100 and y from 1 to 86. There are (often many) duplicates of each coordinate. I want to populate a matrix (effectively a number grid) with the counts of each coordinate. How do I do this?
Right now, my best attempt is:
s=matrix(data=NA,nrow=n,ncol=k)
for(i in 1:n){
for(j in 1:k){
s[i,j]=nrow(subset(data,x_column==i & y_column==j))
}
}
This works for small (~10,000 rows) data frames, but I'd like to run it for a data frame with nearly 3 million rows, and my method is far too slow.
Edit (data):
n=86;k=100;
x_column y_column
54 30
51 32
65 34
19 46
51 27
45 60
62 31
64 45
16 69
31 33
Thanks guys!
Edit: well, it turns out the program was fast enough for my needs -- my workspace was just bogged down with tons of data, and it was slowing everything I tried to do down. So my method works, but it's good to know alternate ways of populating a matrix. I uploaded the first 10 rows; could someone do a speed test?
Here's one approach, using the data.table and Matrix packages:
library(data.table)
library(Matrix)
f <- function(df, nx, ny) {
## Tally up the frequencies
dt <- data.table(df, key=c("x", "y"))
xyN <- dt[, .N, by=key(dt)]
## Place counts in matrix in their respective i/j x/y row/column
as.matrix(with(xyN, sparseMatrix(i=x,j=y,x=N,dims=c(nx,ny))))
}
## Check that it works:
df <- data.frame(x=c(2,2,2,3,3,3), y=c(1,1,1,1,2,2))
f(df, nx=4, ny=4)
# [,1] [,2] [,3] [,4]
# [1,] 0 0 0 0
# [2,] 3 0 0 0
# [3,] 1 2 0 0
# [4,] 0 0 0 0
## Speed test with 3 million coordinates
df <- data.frame(x=sample(1:100, 3e6,replace=T), y=sample(1:86, 3e6, replace=T))
system.time(res <- f(df, nx=100, ny=86))
# user system elapsed
# 0.16 0.03 0.19
sum(res)
# [1] 3e+06
If you can guarantee that you'll have at least some coordinates in each possible row and column, you can just use base R's table()
(though it isn't nearly as fast):
df <- data.frame(x=sample(1:100, 3e6,replace=T), y=sample(1:86, 3e6, replace=T))
system.time(res2 <- as.matrix(table(df)))
# user system elapsed
# 2.67 0.07 2.74
sum(res2)
# [1] 3000000