Search code examples
rdataframegroup-summaries

Group by relative difference in a column (accounting for how data is ordered)


I have a data frame. The snippet is:

df1 <- data.frame(x = c(1, 2, 1, 3, 5, 1, 4, 1), y = c(1, 1, 2, 2, 1, 1, 1, 3))
  x y
1 1 1
2 2 1
3 1 2
4 3 2
5 5 1
6 1 1
7 4 1
8 1 3

I need to group df1 by y and sum over x but accounting for order of y. I.e I need to create new group after each new y and sum over corresponding x. The desired output is

   x y
1  3 1
2  4 2
3 10 1
4  1 3

How to do this in R?


Solution

  • Using data.table:

    library(data.table)
    
    setDT(df1)[, .(x=sum(x), y=y[1]), .(rleid(y))][,rleid:=NULL][]
    
    #>     x y
    #> 1:  3 1
    #> 2:  4 2
    #> 3: 10 1
    #> 4:  1 3
    

    Or another dplyr solution using base::rle:

    library(dplyr)
    
    df1 %>% 
      group_by(y_grp = with(rle(y), rep(seq_along(lengths), lengths))) %>% 
      summarise(x = sum(x), y = y[1]) %>% 
      ungroup %>% select(-y_grp)
    
    #> # A tibble: 4 x 2
    #>       x     y
    #>   <dbl> <dbl>
    #> 1     3     1
    #> 2     4     2
    #> 3    10     1
    #> 4     1     3