Search code examples
raggregatebinning

Restructuring values in unequal buckets by aggregating on a column in R


I have a dataset which looks like the one below:

| Id |  Name | Date_diff |
|----|:-----:|----------:|
| 50 | David |  0        |
| 50 | David | -16       |
| 50 | David | -4        |
| 50 | David | -1        |
| 50 | David |  0        |
| 50 | David | -2        |
| 84 | Ron   | -11       |
| 84 | Ron   | -12       |
| 84 | Ron   | -168      |
| 84 | Ron   | -8        |
| 84 | Ron   | 16        |
| 84 | Ron   | NA        |

Reproducible code is:

df= data.frame(Id= c('50','84'), Name= c('David','Ron'))
df=df[rep(seq_len(nrow(df)),each=6),]
Date_diff= c(0,-16,-4,-1,0,-2,-11,-12,-168,-8,16,'NA')
df=data.frame(df,Date_diff)

Now, for each Id, I need to create different columns of unequal buckets which will have the count of values in the column 'Date-diff'. The bucket ranges need to be 'NA', '>0','0','-1','-2 to -3', '-4 to -6','-7 to -12' and '>-12'. There will also be an additional column 'total' which will hold the summed values present in the buckets.

For example, when we consider Id=50, we see that there are 2 counts for the value '0' which would fall in the bucket '0', 1 count for the value '-16' which would fall in the bucket '>0', 1 count for the value -4 which would fall in the range '-4 to -6' and so on. The final table should be as represented below:

| Id |  Name | NA | >0 | 0 | -1 | -2 to -3 | -4 to -6 | -7 to -12 | >-12 | Total |
|----|:-----:|---:|----|---|----|----------|----------|-----------|------|-------|
| 50 | David |  0 | 0  | 2 | 1  | 1        | 1        | 0         | 1    | 6     |
| 84 |  Ron  |  1 | 1  | 0 | 0  | 0        | 0        | 3         | 1    | 6     |

I initially tried to create a new column to categorize the values in 'Date_diff' in it, but the values being provided in breaks is probably wrong. This is what I have tried:

df <- transform(df, group=cut(Date_diff,  breaks=c(-Inf,-13,-7,-4,-2,-1,Inf),
                               labels=c('<-12', '-7 to -12','-4 to -6','-2 to -3', '-1','>0')))

Can someone please let me know how to achieve the desired result?


Solution

  • One of the issues was having 'NA' as a character string instead of NA. Here is a solution with:

    df <- data.frame(
      id = c('50', '84'),  
      name = c('david', 'ron'),
      date_diff = c(0, -16, -4, -1, 0, -2, -11, -12, -168, -8, 16,  na)
    )
    
    library(dplyr)
    library(tidyr)
    
    df %>%
      mutate(
        group = cut(
          Date_diff,
          breaks = c(-Inf,-13,-7,-4,-2,-1,Inf),
          labels = c('<-12', '-7 to -12','-4 to -6','-2 to -3', '-1','>0')
        ),
        group = if_else(is.na(group), "NA", as.character(group))
      ) %>%
      group_by(Id, Name, group) %>%
      summarise(n = n()) %>%
      mutate(Total = sum(n, na.rm = T)) %>%
      pivot_wider(names_from = group, values_from = n)