Search code examples
rdataframebinning

Better way of binning data in a group in a data frame by equal intervals


I have a dataframe of which is characterized by many different ID's. For every ID there are multiple events which are characterized by the cumulative time duration between events(hours) and the duration of that event(seconds). So, it would look something like:

Id <- c(1,1,1,1,1,1,2,2,2,2,2)
cumulative_time<-c(0,3.58,8.88,11.19,21.86,29.54,0,5,14,19,23)
duration<-c(188,124,706,53,669,1506.2,335,349,395,385,175)
test = data.frame(Id,cumulative_time,duration)

> test
   Id cummulative_time duration
1   1             0.00    188.0
2   1             3.58    124.0
3   1             8.88    706.0
4   1            11.19     53.0
5   1            21.86    669.0
6   1            29.54   1506.2
7   2             0.00    335.0
8   2             5.00    349.0
9   2            14.00    395.0
10  2            19.00    385.0
11  2            23.00    175.0

I would like to group by the ID and then restructure the group by sampling by a cumulative amount of every say 10 hours, and in that 10 hours sum by the duration that occurred in the 10 hour interval. The number of bins I want should be from say 0 to 30 hours. Thus were would be 3 bins.

I looked at the cut function and managed to make a hack of it within a dataframe - even me as a new r user I know it isn't pretty

test_cut = test %>% 
  mutate(bin_durations = cut(test$cummulative_time,breaks = c(0,10,20,30),labels = c("10","20","30"),include.lowest = TRUE)) %>% 
  group_by(Id,bin_durations) %>% 
  mutate(total_duration = sum(duration)) %>% 
  select(Id,bin_durations,total_duration) %>% 
  distinct()

which gives the output:

test_cut 
  Id time_bins duration
1  1        10   1018.0
2  1        20     53.0
3  1        30   2175.2
4  2        10    684.0
5  2        20    780.0
6  2        30    175.0

Ultimately I want the interval window and number of bins to be arbitrary - If I have a span of 5000 hours and I want to bin in 1 hour samples. For this I would use breaks=seq(0,5000,1) for the bins I would say labels = as.character(seq(1,5000,1))

This is will also be applied to a very large data frame, so computational speed somewhat desired.

A dplyr solution would be great since I am applying the binning per group.

My guess is there is a nice interaction between cut and perhaps split to generate the desired output.

Thanks in advance.

Update

After testing, I find that even my current implementation isn't quite what I'd like as if I say:

n=3
test_cut = test %>% 
  mutate(bin_durations = cut(test$cumulative_time,breaks=seq(0,30,n),labels = as.character(seq(n,30,n)),include.lowest = TRUE)) %>% 
  group_by(Id,bin_durations) %>% 
  mutate(total_duration = sum(duration)) %>% 
  select(Id,bin_durations,total_duration) %>% 
  distinct()

I get

test_cut
# A tibble: 11 x 3
# Groups:   Id, bin_durations [11]
      Id bin_durations total_duration
   <dbl> <fct>                  <dbl>
 1     1 3                       188 
 2     1 6                       124 
 3     1 9                       706 
 4     1 12                       53 
 5     1 24                      669 
 6     1 30                     1506.
 7     2 3                       335 
 8     2 6                       349 
 9     2 15                      395 
10     2 21                      385 
11     2 24                      175 

Where there are no occurrences in the bin sequence I should just get 0 in the duration column. Rather than an omission.

Thus, it should look like:

test_cut
# A tibble: 11 x 3
# Groups:   Id, bin_durations [11]
      Id bin_durations total_duration
   <dbl> <fct>                  <dbl>
 1     1 3                       188 
 2     1 6                       124 
 3     1 9                       706 
 4     1 12                       53 
 5     1 15                        0 
 6     1 18                        0
 7     1 21                        0    
 8     1 24                      669
 9     1 27                        0 
10     1 30                     1506.
11     2 3                       335 
12     2 6                       349
13     2 9                         0
14     2 12                        0  
15     2 15                      395
16     2 18                        0 
17     2 21                      385 
18     2 24                      175
19     2 27                        0
20     2 30                        0 

Solution

  • Here is one idea via integer division (%/%)

    library(tidyverse)
    
    test %>% 
     group_by(Id, grp = cumulative_time %/% 10) %>% 
     summarise(toatal_duration = sum(duration))
    

    which gives,

    # A tibble: 6 x 3
    # Groups:   Id [?]
         Id   grp toatal_duration
      <dbl> <dbl>           <dbl>
    1     1     0           1018 
    2     1     1             53 
    3     1     2           2175.
    4     2     0            684 
    5     2     1            780 
    6     2     2            175 
    

    To address your updated issue, we can use complete in order to add the missing rows. So, for the same example, binning in hours of 3,

    test %>%
         group_by(Id, grp = cumulative_time %/% 3) %>%
         summarise(toatal_duration = sum(duration)) %>%
         ungroup() %>%
         complete(Id, grp = seq(min(grp), max(grp)), fill = list(toatal_duration = 0))
    

    which gives,

         # A tibble: 20 x 3
          Id   grp toatal_duration
       <dbl> <dbl>           <dbl>
     1     1     0            188 
     2     1     1            124 
     3     1     2            706 
     4     1     3             53 
     5     1     4              0 
     6     1     5              0 
     7     1     6              0 
     8     1     7            669 
     9     1     8              0 
    10     1     9           1506.
    11     2     0            335 
    12     2     1            349 
    13     2     2              0 
    14     2     3              0 
    15     2     4            395 
    16     2     5              0 
    17     2     6            385 
    18     2     7            175 
    19     2     8              0 
    20     2     9              0