Search code examples
rhistogramfrequencyecdf

What is the fastest way to obtain frequencies of integers in a vector?


Is there a simple and fast way to obtain the frequency of each integer that occurs in a vector of integers in R?

Here are my attempts so far:

x <- floor(runif(1000000)*1000)

print('*** using TABLE:')
system.time(as.data.frame(table(x)))

print('*** using HIST:')
system.time(hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE))

print('*** using SORT')
system.time({cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf})

print('*** using ECDF')
system.time({i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])})

print('*** counting in loop')
system.time({h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h})

#print('*** vectorized summation') #This uses too much memory if x is large
#system.time(colSums(matrix(rbind(min(x):max(x))[rep(1,length(x)),]==x,ncol=max(x)-min(x)+1)))

#Note: There are some fail cases in some of the above methods that need patching if, for example, there is a chance that some integer bins are unoccupied

and here are the results:

[1] "*** using TABLE:"
   user  system elapsed 
   1.26    0.03    1.29 
[1] "*** using HIST:"
   user  system elapsed 
   0.11    0.00    0.10 
[1] "*** using SORT"
   user  system elapsed 
   0.22    0.02    0.23 
[1] "*** using ECDF"
   user  system elapsed 
   0.17    0.00    0.17 
[1] "*** counting in loop"
   user  system elapsed 
   3.12    0.00    3.12 

As you can see table is ridiculously slow and hist seems to be the fastest. But hist (as I am using it) is working on arbitrarily-specifiable breakpoints, whereas I simply want to bin integers. Isn't there a way to trade that flexibility for better performance?

In C, for(i=0;i<1000000;i++)h[x[i]]++; would be blisteringly fast.


Solution

  • The fastest is to use tabulate but it requires positive integers as input, so you have to do a quick monotonic transformation.

    set.seed(21)
    x <- as.integer(runif(1e6)*1000)
    system.time({
      adj <- 1L - min(x)
      y <- setNames(tabulate(x+adj), sort(unique(x)))
    })