Search code examples
rgroup-bydata.tableaggregatefrequency

How to create bin frequency table where bin size varies by group


I'm trying to create a bin frequency table, where there are multiple grouping columns, but, more importantly, bin size varies by one of the grouping columns. Let me illustrate:

set.seed(42)
ID <- as.factor(c(rep("A",20),rep("B",22)))
date <- as.factor(c(rep("C",12),rep("D",8),rep("E",10),rep("F",12)))
group <- as.factor(c(rep("G",6),rep("H",6),rep("G",8),rep("G",6),rep("H",4),rep("G",6),rep("H",6)))
val <- round(rnorm(42,20,10),0)

df <- data.frame(ID,date,group,val)

A frequency table for val by ID, date and group can be generated, using some code I have edited from this post:

br <- c(0,10,30,100)
frqtab <- aggregate(val~ID+date+group,df,FUN=function(x) table(cut(x, br)))

However, I would like to have different bin sizes for each factor within group, e.g. group G could remain with brG <- c(0,10,30,100) and group H could be brH <- c(0,10,50,100). I guess, I could write some ifelse function, but that would be very messy, particularly because my real data has many groups. Any help will be much appreciated!


Solution

  • Here is a possible solution:

    # example data
    set.seed(42)
    ID <- as.factor(c(rep("A",20),rep("B",22)))
    date <- as.factor(c(rep("C",12),rep("D",8),rep("E",10),rep("F",12)))
    group <- as.factor(c(rep("G",6),rep("H",6),rep("G",8),rep("G",6),rep("H",4),rep("G",6),rep("H",6)))
    val <- round(rnorm(42,20,10),0)
    
    df <- data.frame(ID,date,group,val)
    
    # using the function you provided
    f = function(br, df) {aggregate(val~ID+date+group,df,FUN=function(x) table(cut(x, br)))}
    
    library(tidyverse)
    
    # create a look up table
    # (specify the breaks for each group)
    look_up = data_frame(group_id = c("G","H"),
                         br = list(c(0,10,30,100), c(0,10,50,100)))
    
    df_upd = df %>%
      group_by(group_id = group) %>%          # duplicate group column and group by it
      nest() %>%                              # nest data
      left_join(look_up, by="group_id") %>%   # join look up table to get corresponding breaks
      mutate(d = map2(br, data, ~f(.x, .y)))  # apply function
    
    # see results
    df_upd$d
    
    # [[1]]
    #   ID date group val.(0,10] val.(10,30] val.(30,100]
    # 1  A    C     G          0           5            1
    # 2  A    D     G          1           4            1
    # 3  B    E     G          1           3            2
    # 4  B    F     G          1           5            0
    # 
    # [[2]]
    #   ID date group val.(0,10] val.(10,50] val.(50,100]
    # 1  A    C     H          0           6            0
    # 2  B    E     H          1           3            0
    # 3  B    F     H          0           5            0
    

    I've decided to use the function you provided, which obviously includes the breaks into the column names. For this reason, when you have different breaks for different groups, the output cannot be included in one data frame as there will be a column name conflict.

    The only way to get everything in one data frame is if you change your function to produce a more "tidy" output:

    library(tidyverse)
    
    # updated function
    f = function(br, df) {
      df %>%
      mutate(g = cut(val, br)) %>%
      na.omit() %>%
      count(g, ID, date, group) %>%
      complete(g, nesting(ID, date, group), fill=list(n=0)) }
    
    # same lookup table
    look_up = data_frame(group_id = c("G","H"),
                         br = list(c(0,10,30,100), c(0,10,50,100)))
    
    # apply your function
    df %>%
      group_by(group_id = group) %>%          
      nest() %>%                              
      left_join(look_up, by="group_id") %>%   
      mutate(d = map2(br, data, ~f(.x, .y))) %>%
      unnest(d) %>%
      select(-group_id) %>%
      arrange(group, date, ID)   # for visualisation purposes only
    
    # # A tibble: 21 x 5
    #   g        ID    date  group     n
    #   <chr>    <fct> <fct> <fct> <dbl>
    # 1 (0,10]   A     C     G         0
    # 2 (10,30]  A     C     G         5
    # 3 (30,100] A     C     G         1
    # 4 (0,10]   A     D     G         1
    # 5 (10,30]  A     D     G         4
    # 6 (30,100] A     D     G         1
    # 7 (0,10]   B     E     G         1
    # 8 (10,30]  B     E     G         3
    # 9 (30,100] B     E     G         2
    # 10 (0,10]  B     F     G         1
    # # ... with 11 more rows