Search code examples
rcoordinate

R: Need to populate a matrix with counts of coordinates (from data frame with list of coordinates, some of which are duplicates)


I have a list of events with (x,y) coordinates, where x ranges from 1 to 100 and y from 1 to 86. There are (often many) duplicates of each coordinate. I want to populate a matrix (effectively a number grid) with the counts of each coordinate. How do I do this?

Right now, my best attempt is:

s=matrix(data=NA,nrow=n,ncol=k)
for(i in 1:n){
  for(j in 1:k){
    s[i,j]=nrow(subset(data,x_column==i & y_column==j))
  }
}

This works for small (~10,000 rows) data frames, but I'd like to run it for a data frame with nearly 3 million rows, and my method is far too slow.

Edit (data):

n=86;k=100;
x_column y_column
54          30
51          32
65          34
19          46
51          27
45          60
62          31
64          45
16          69
31          33

Thanks guys!

Edit: well, it turns out the program was fast enough for my needs -- my workspace was just bogged down with tons of data, and it was slowing everything I tried to do down. So my method works, but it's good to know alternate ways of populating a matrix. I uploaded the first 10 rows; could someone do a speed test?


Solution

  • Here's one approach, using the data.table and Matrix packages:

    library(data.table)
    library(Matrix)
    
    f <- function(df, nx, ny)  {
        ## Tally up the frequencies
        dt <- data.table(df, key=c("x", "y"))
        xyN <- dt[, .N, by=key(dt)]
        ## Place counts in matrix in their respective i/j x/y row/column
        as.matrix(with(xyN, sparseMatrix(i=x,j=y,x=N,dims=c(nx,ny))))
    }
    
    ## Check that it works:
    df <- data.frame(x=c(2,2,2,3,3,3), y=c(1,1,1,1,2,2))
    f(df, nx=4, ny=4)
    #      [,1] [,2] [,3] [,4]
    # [1,]    0    0    0    0
    # [2,]    3    0    0    0
    # [3,]    1    2    0    0
    # [4,]    0    0    0    0
    
    ## Speed test with 3 million coordinates
    df <- data.frame(x=sample(1:100, 3e6,replace=T), y=sample(1:86, 3e6, replace=T))
    system.time(res <- f(df, nx=100, ny=86))
    #    user  system elapsed 
    #    0.16    0.03    0.19 
    sum(res)
    # [1] 3e+06
    

    If you can guarantee that you'll have at least some coordinates in each possible row and column, you can just use base R's table() (though it isn't nearly as fast):

    df <- data.frame(x=sample(1:100, 3e6,replace=T), y=sample(1:86, 3e6, replace=T))
    system.time(res2 <- as.matrix(table(df)))
    #    user  system elapsed 
    #    2.67    0.07    2.74 
    sum(res2)
    # [1] 3000000