Search code examples
rdplyrgroup-bysumdataset

dplyr & groups : what is the difference between keep and ungroup or directly drop?


I need to sum the observations referred to the same individual without having a unique identification code/row for them .

This is a sample of the dataset

> head(dataset, 20)
   nquest nord tpens
1     173    1  1800
2     633    1   300
3     633    1   600
4     923    1   500
5    2886    1  1211
6    2886    2  2100
7    5416    1   700
8    7886    1  1800
9    7886    1   200
10  20297    1  1200
11  20711    2  2000
12  22169    1   600
13  22169    1   280
14  22173    2  1000
15  22276    1  1200
16  22286    1   850
17  22286    2   650
18  22657    1  1400
19  22657    2  1500
20  23490    1  1400

The variables are:

  1. nquest = is the code of the family to which the individual belong
  2. nord = is the position of the individual in the family ( 1=husband, 2=wife, 3= son, etc..)
  3. tpens = is the wage that each one of them earn

I need to sum the values of the wage that are referred to the same individual. For example

Dataset

As you can see, these values of tpens are referred to the same individual because not only nquest is the same ( family code) , but also nord.

I've tried to do it in 2 ways ( following some suggestions )

First way

new_dataset <- dataset %>%
  replace(is.na(.), 0) %>%
  group_by(nquest, nord) %>% 
  summarize(tpens = sum(tpens), .groups = 'drop')

Second way

new_dataset <- dataset %>%   
  replace(is.na(.), 0) %>%   
  group_by(nquest, nord) %>%    
  summarize(tpens = sum(tpens), .groups = 'keep') %>% 
  ungroup

Are they right? Can anyone explain me the difference between computing the sum with keep groups and then ungroup and instead drop the groups directly ??

I'm a bit confused because I do not understand this thing: if I make the sum of the values that correspond to each individual, I should not have groups at the end of the process... but just 1 indvidual per rows ( Am I wrong?). If I merge this dataset with another one matching by nquest and nord ( hence for each person ), I get instead # A tibble: 6 x 41 # Groups: nquest, nord [6].

How is that possible?


Solution

  • The difference between using .groups = 'keep' and .groups = 'drop' lies in the state of the tibble after these functions. If you use .groups = 'keep', the tibble will be grouped until you run ungroup(). However, if you use .groups = 'drop', the tibble will no longer be grouped after you run summarize. To learn more, check out the "Verbs" section of the documentation here.

    Take this example:

    data("iris")
    library(dplyr)
    
    ## Let's try "keep"
    grouped <- iris %>%
      group_by(Species) %>%\
      summarise(count = n(), .groups = "keep")
    grouped
    
    #> # A tibble: 3 × 2
    #> # Groups:   Species [3]
    #>   Species    count
    #>   <fct>      <int>
    #> 1 setosa        50
    #> 2 versicolor    50
    #> 3 virginica     50
    
    grouped %>% group_data()
    #> # A tibble: 3 × 2
    #>   Species          .rows
    #>   <fct>      <list<int>>
    #> 1 setosa             [1]
    #> 2 versicolor         [1]
    #> 3 virginica          [1]
    
    ## Now, let's try "drop"
    ungrouped <- iris %>%
      group_by(Species) %>%\
      summarise(count = n(), .groups = "keep")
    ungrouped
    
    #> # A tibble: 3 × 2
    #> # Groups:   Species [3]
    #>   Species    count
    #>   <fct>      <int>
    #> 1 setosa        50
    #> 2 versicolor    50
    #> 3 virginica     50
    
    ungrouped %>% group_data()
    #> # A tibble: 1 × 1
    #>         .rows
    #>   <list<int>>
    #> 1         [3]
    

    The key difference is these outputs is the grouping - if we do not ungroup() or use .groups = 'drop', the output remains grouped. This means that future operations will treat this tibble as grouped, which could create unintended consequences.

    If you only need to use grouping for one function, try the .by parameter. Learn more here. This way instead of having to remember to use .groups = 'drop' or ungroup(), you can just write:

    iris %>%
      summarise(count = n(), .by = Species)
    
    #> # A tibble: 3 × 2
    #>   Species    count
    #>   <fct>      <int>
    #> 1 setosa        50
    #> 2 versicolor    50
    #> 3 virginica     50
    

    Learn more about grouped data here.