r group-by data.table aggregate frequency

How to create bin frequency table where bin size varies by group

I'm trying to create a bin frequency table, where there are multiple grouping columns, but, more importantly, bin size varies by one of the grouping columns. Let me illustrate:

set.seed(42)
ID <- as.factor(c(rep("A",20),rep("B",22)))
date <- as.factor(c(rep("C",12),rep("D",8),rep("E",10),rep("F",12)))
group <- as.factor(c(rep("G",6),rep("H",6),rep("G",8),rep("G",6),rep("H",4),rep("G",6),rep("H",6)))
val <- round(rnorm(42,20,10),0)

df <- data.frame(ID,date,group,val)

A frequency table for val by ID, date and group can be generated, using some code I have edited from this post:

br <- c(0,10,30,100)
frqtab <- aggregate(val~ID+date+group,df,FUN=function(x) table(cut(x, br)))

However, I would like to have different bin sizes for each factor within group, e.g. group G could remain with brG <- c(0,10,30,100) and group H could be brH <- c(0,10,50,100). I guess, I could write some ifelse function, but that would be very messy, particularly because my real data has many groups. Any help will be much appreciated!

Solution

Here is a possible solution:

# example data
set.seed(42)
ID <- as.factor(c(rep("A",20),rep("B",22)))
date <- as.factor(c(rep("C",12),rep("D",8),rep("E",10),rep("F",12)))
group <- as.factor(c(rep("G",6),rep("H",6),rep("G",8),rep("G",6),rep("H",4),rep("G",6),rep("H",6)))
val <- round(rnorm(42,20,10),0)

df <- data.frame(ID,date,group,val)

# using the function you provided
f = function(br, df) {aggregate(val~ID+date+group,df,FUN=function(x) table(cut(x, br)))}

library(tidyverse)

# create a look up table
# (specify the breaks for each group)
look_up = data_frame(group_id = c("G","H"),
                     br = list(c(0,10,30,100), c(0,10,50,100)))

df_upd = df %>%
  group_by(group_id = group) %>%          # duplicate group column and group by it
  nest() %>%                              # nest data
  left_join(look_up, by="group_id") %>%   # join look up table to get corresponding breaks
  mutate(d = map2(br, data, ~f(.x, .y)))  # apply function

# see results
df_upd$d

# [[1]]
#   ID date group val.(0,10] val.(10,30] val.(30,100]
# 1  A    C     G          0           5            1
# 2  A    D     G          1           4            1
# 3  B    E     G          1           3            2
# 4  B    F     G          1           5            0
# 
# [[2]]
#   ID date group val.(0,10] val.(10,50] val.(50,100]
# 1  A    C     H          0           6            0
# 2  B    E     H          1           3            0
# 3  B    F     H          0           5            0

I've decided to use the function you provided, which obviously includes the breaks into the column names. For this reason, when you have different breaks for different groups, the output cannot be included in one data frame as there will be a column name conflict.

The only way to get everything in one data frame is if you change your function to produce a more "tidy" output:

library(tidyverse)

# updated function
f = function(br, df) {
  df %>%
  mutate(g = cut(val, br)) %>%
  na.omit() %>%
  count(g, ID, date, group) %>%
  complete(g, nesting(ID, date, group), fill=list(n=0)) }

# same lookup table
look_up = data_frame(group_id = c("G","H"),
                     br = list(c(0,10,30,100), c(0,10,50,100)))

# apply your function
df %>%
  group_by(group_id = group) %>%          
  nest() %>%                              
  left_join(look_up, by="group_id") %>%   
  mutate(d = map2(br, data, ~f(.x, .y))) %>%
  unnest(d) %>%
  select(-group_id) %>%
  arrange(group, date, ID)   # for visualisation purposes only

# # A tibble: 21 x 5
#   g        ID    date  group     n
#   <chr>    <fct> <fct> <fct> <dbl>
# 1 (0,10]   A     C     G         0
# 2 (10,30]  A     C     G         5
# 3 (30,100] A     C     G         1
# 4 (0,10]   A     D     G         1
# 5 (10,30]  A     D     G         4
# 6 (30,100] A     D     G         1
# 7 (0,10]   B     E     G         1
# 8 (10,30]  B     E     G         3
# 9 (30,100] B     E     G         2
# 10 (0,10]  B     F     G         1
# # ... with 11 more rows