Search code examples
rdata-cleaning

How to sum rows based on exact conditions on multiple columns and save edited rows in original dataset?


There are 3 parts to this problem:

1) I want to sum values in column b,c,d for any two adjacent rows which have the same values for columns(b,c,d)

2) I would like to keep values in other columns the same. (Some other column (eg. a) may contain character data.)

3) I would like to keep the changes by replacing the original value in columns b,c,d in the first row (of the 2 same rows) with the new values (the sums) and delete the second row(of the 2 same rows).

   Time         a    b     c     d    id 
1  2014/10/11   A   40    20    10     1
2  2014/10/12   A   40    20    10     2
3  2014/10/13   B    9    10     9     3
4  2014/10/14   D   16     5    12     4
5  2014/10/15   D    1     6     5     5
6  2014/10/16   B   20     7     8     6
7  2014/10/17   B   20     7     8     7
8  2014/10/18   A   11     9     5     8
9  2014/10/19   C   31    20    23     9

Expected outcome:

    Time         a    b     c     d    id 
 1  2014/10/11   A   80    40    20     1 *
 3  2014/10/13   B    9    10     9     3
 4  2014/10/14   D   16     5    12     4
 5  2014/10/15   D    1     6     5     5
 6  2014/10/16   B   40    14    16     6 *
 8  2014/10/18   A   11     9     5     8
 9  2014/10/19   C   31    20    23     9

id 1 and 2 combined to become id 1; id 6 and 7 combined to become id 6.

Thank you. Any contribution is greatly appreciated.


Solution

  • Using dplyr functions along with data.table::rleid. To get same values for adjacent b, c and d columns we paste them and use rleid to create groups. For each group we sum the values at b, c and d columns and keep only the 1st row.

    library(dplyr)
    
    df %>%
      mutate(temp_col = paste(b, c, d, sep = "-")) %>%
      group_by(group = data.table::rleid(temp_col)) %>%
      mutate_at(vars(b, c, d), sum) %>%
      slice(1L) %>%
      ungroup %>%
      select(-temp_col, -group)
    
    #  Time       a         b     c     d    id
    #  <fct>      <fct> <int> <int> <int> <int>
    #1 2014/10/11 A        80    40    20     1
    #2 2014/10/13 B         9    10     9     3
    #3 2014/10/14 D        16     5    12     4
    #4 2014/10/15 D         1     6     5     5
    #5 2014/10/16 B        40    14    16     6
    #6 2014/10/18 A        11     9     5     8
    #7 2014/10/19 C        31    20    23     9
    

    data

    df <- structure(list(Time = structure(1:9, .Label = c("2014/10/11", 
    "2014/10/12", "2014/10/13", "2014/10/14", "2014/10/15", "2014/10/16", 
    "2014/10/17", "2014/10/18", "2014/10/19"), class = "factor"), 
    a = structure(c(1L, 1L, 2L, 4L, 4L, 2L, 2L, 1L, 3L), .Label = c("A", 
    "B", "C", "D"), class = "factor"), b = c(40L, 40L, 9L, 16L, 
    1L, 20L, 20L, 11L, 31L), c = c(20L, 20L, 10L, 5L, 6L, 7L, 
    7L, 9L, 20L), d = c(10L, 10L, 9L, 12L, 5L, 8L, 8L, 5L, 23L
    ), id = 1:9), class = "data.frame", row.names = c("1", "2", 
    "3", "4", "5", "6", "7", "8", "9"))