Search code examples
rmatlabhistogramaveragebins

An R-like approach to averaging by histogram bin


as a person transitioning from Matlab, I wish any advice for a more efficient way to find the average of DepDelay values whose indices (indxs) fall within histogram bins (edges). In Matlab and my current R script, I have these commands:

edges       =   seq( min(t), max(t), by = dt )
indxs       =   findInterval( t, edges,all.inside=TRUE )
listIndx    =   sort( unique( indxs ) )
n           =   length( edges )
avgDelay    =   rep( 1, n) * 0
for (i in 1 : n ){
  id = listIndx[i]
  jd = which( id == indxs )
  if ( length(jd) > minFlights){
    avgDelay[id] = mean(DepDelay[jd])
  }
}

I know that using for-loops in R is a potentially fraught issue, but I ask this question in the interests of improved code efficiency.

Sure. A few snippets of the relevant vectors:

DepDelay[1:20] = [1] -4 -4 -4 -9 -6 -7 -1 -7 -6 -7 -7 -5 -8 -3 51 -2 -1 -4 -7 -10

and associated indxs values:

indxs[1:20] = [1] 3 99 195 291 387 483 579 675 771 867 963 1059 1155 1251 1351 1443 1539 1635 1731 1827 

minFlights = 3

Thank you.

BSL


Solution

  • There are many ways to do this in R, all involving variations on the "split-apply-combine" strategy (split the data into groups, apply a function to each group, combine the results by group back into a single data frame).

    Here's one method using the dplyr package. I've created some fake data for illustration, since your data is not in an easily reproducible form:

    library(dplyr) 
    
    # Create fake data
    set.seed(20)
    dat = data.frame(DepDelay = sample(-50:50, 1000, replace=TRUE))
    
    # Bin the data
    dat$bins = cut(dat$DepDelay, seq(-50,50,10), include.lowest=TRUE)
    
    # Summarise by bin
    dat %>% group_by(bins) %>%
      summarise(count = n(),
                meanByBin = mean(DepDelay, na.rm=TRUE))
    
            bins count  meanByBin
    1  [-50,-40]   111 -45.036036
    2  (-40,-30]   110 -34.354545
    3  (-30,-20]    95 -24.242105
    4  (-20,-10]    82 -14.731707
    5    (-10,0]    92  -4.304348
    6     (0,10]   109   5.477064
    7    (10,20]    93  14.731183
    8    (20,30]    93  25.182796
    9    (30,40]   103  35.466019
    10   (40,50]   112  45.696429
    

    data.table is another great package for this kind of task:

    library(data.table)
    
    datDT = data.table(dat)
    setkey(datDT, bins)
    
    datDT[, list(count=length(DepDelay), meanByBin=mean(DepDelay, na.rm=TRUE)), by=bins]
    

    And here are two ways to calculate the mean by bin in base R:

    tapply(dat$DepDelay, dat$bins, mean)
    
    aggregate(DepDelay ~ bins, FUN=mean, data=dat)