Search code examples
rdplyrgroup-byaggregate

Collapse specific rows/cases of dataframe


I want to collapse some specific rows of a data.frame (preferably using dplyr in ). Collapsing should aggregate some columns by the functions sum(), others by mean().

As an example, let's add a unique character-based ID to the iris dataset.

iris_df <- iris[1:5,]
iris_df$ID <- paste("ID_",1:nrow(iris_df),sep="")

That's from where we start:

structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5), 
               Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6),
               Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4), 
               Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2),
               Species = structure(c(1L, 1L, 1L, 1L, 1L),
                                   .Label = c("setosa", "versicolor", "virginica"), class = "factor"),
               ID = c("ID_1", "ID_2", "ID_3", "ID_4","ID_5")),
          row.names = c(NA, 5L), class = "data.frame")

Now, I'd like to collapse the cases where ID==ID_1 + ID==ID_2. For that purpose, the Sepal values should be aggregated as means and the Petal values as sums. The ID should become "ID_1+ID_2" (so aggregation by paste()?)

This is how the final result should look like:

structure(list(Sepal.Length = c(5.0, 4.7, 4.6, 5), 
               Sepal.Width = c(3.25, 3.2, 3.1, 3.6),
               Petal.Length = c(2.8, 1.3, 1.5, 1.4), 
               Petal.Width = c(0.4, 0.2, 0.2, 0.2),
               Species = structure(c(1L, 1L, 1L, 1L),
                                   .Label = c("setosa", "versicolor", "virginica"), class = "factor"),
               ID = c("ID_1+ID_2", "ID_3", "ID_4","ID_5")),
          row.names = c(NA, 4L), class = "data.frame")

Can this be done using dplyr (using group_by() and summarize()) package?

Update: As some additional note, the desired procedure should acknowledge that the row index are not known apriori, e.g. just that ID_x and ID_y need to be collapsed (and ID_x might be row i and ID_y at row j).


Solution

  • We could create a grouping based on the presence of those ID using %in%

    library(dplyr)
    library(stringr)
    df1 %>% 
       group_by(grp = case_when(ID %in% c("ID_1", "ID_2") ~ 0L, 
            TRUE ~ row_number()), Species) %>% 
       summarise(across(starts_with("Sepal"), mean), 
       across(starts_with("Petal"), sum), ID = str_c(ID, collapse="+"), 
             .groups = 'drop') %>% 
       select(-grp)
    

    -output

    # A tibble: 4 x 6
      Species Sepal.Length Sepal.Width Petal.Length Petal.Width ID       
      <fct>          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
    1 setosa           5          3.25          2.8         0.4 ID_1+ID_2
    2 setosa           4.7        3.2           1.3         0.2 ID_3     
    3 setosa           4.6        3.1           1.5         0.2 ID_4     
    4 setosa           5          3.6           1.4         0.2 ID_5     
    

    if there is only a single 'Species', then we could also use first

    df1 %>% 
       group_by(grp = case_when(ID %in% c("ID_1", "ID_2") ~ 0L, 
            TRUE ~ row_number())) %>%  
       summarise(across(starts_with("Sepal"), mean), 
       across(starts_with("Petal"), sum), Species = first(Species), 
            ID = str_c(ID, collapse="+"), 
             .groups = 'drop') %>% 
       select(-grp)
    # A tibble: 4 x 6
      Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID       
             <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>    
    1          5          3.25          2.8         0.4 setosa  ID_1+ID_2
    2          4.7        3.2           1.3         0.2 setosa  ID_3     
    3          4.6        3.1           1.5         0.2 setosa  ID_4     
    4          5          3.6           1.4         0.2 setosa  ID_5     
    

    Or another option is to create a new level by collapsing the IDs or interest in fct_collapse

    library(forcats)
    df1 %>%
       group_by(grp = fct_collapse(ID, other = c("ID_1", "ID_2"))) %>% 
       summarise(across(starts_with("Sepal"), mean), 
       across(starts_with("Petal"), sum), Species = first(Species), 
            ID = str_c(ID, collapse="+"), 
             .groups = 'drop') %>% 
       select(-grp)
    # A tibble: 4 x 6
      Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID       
             <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>    
    1          5          3.25          2.8         0.4 setosa  ID_1+ID_2
    2          4.7        3.2           1.3         0.2 setosa  ID_3     
    3          4.6        3.1           1.5         0.2 setosa  ID_4     
    4          5          3.6           1.4         0.2 setosa  ID_5