Search code examples
rgroup-bydata.tablecut

Bin data within a group using breaks from another DF


How to avoid using the for loop in the following code to speed up the computation (the real data is about 1e6 times larger)

id = rep(1:5, 20)
v = 1:100
df = data.frame(groupid = id, value = v)
df = dplyr::arrange(df, groupid)

bkt = rep(seq(0, 100, length.out = 4), 5)
id = rep(1:5, each = 4)
bktpts = data.frame(groupid = id, value = bkt)

for (i in 1:5) {
  df[df$groupid == i, "bin"] = cut(df[df$groupid == i, "value"],
                                   bktpts[bktpts$groupid == i, "value"],
                                   include.lowest = TRUE, labels = F)
}

Solution

  • I came out with another data.table answer:

    library(data.table) # load package
    
    # set to data.table
    setDT(df) 
    setDT(bktpts)
    
    # Make a join
    df[bktpts[, list(.(value)), by = groupid], bks := V1, on = "groupid"]
    
    # define the bins:
    df[, bin := cut(value, bks[[1]], include.lowest = TRUE, labels = FALSE), by = groupid]
    
    # remove the unneeded bks column
    df[, bks := NULL]
    

    Explaining the code:

    bktpts[, list(.(value)), by = groupid] is a new table that has in a list al the values of value for each groupid. If you run it alone, you'll understand where we're going.

    bks := V1 assigns to variable bks in df whatever exists in V1, which is the name of the list column in the previous table. Of course on = "groupid" is the variable on which we make the join.

    The code defining the bins needs little explanation, except by the bks[[1]] bit. It needs to be [[ in order to access the list values and provide a vector, as required by the cut function.

    EDIT TO ADD:

    All data.table commands can be chained in a -rather unintelligible- single call:

    df[bktpts[, list(.(value)), by = groupid], 
       bks := V1, 
       on = "groupid"][, 
                    bin := cut(value, 
                               bks[[1]], 
                               include.lowest = TRUE, 
                               labels = FALSE), 
                    by = groupid][, 
                                 bks := NULL]