I need to sum the observations referred to the same individual without having a unique identification code/row for them .
This is a sample of the dataset
> head(dataset, 20)
nquest nord tpens
1 173 1 1800
2 633 1 300
3 633 1 600
4 923 1 500
5 2886 1 1211
6 2886 2 2100
7 5416 1 700
8 7886 1 1800
9 7886 1 200
10 20297 1 1200
11 20711 2 2000
12 22169 1 600
13 22169 1 280
14 22173 2 1000
15 22276 1 1200
16 22286 1 850
17 22286 2 650
18 22657 1 1400
19 22657 2 1500
20 23490 1 1400
The variables are:
nquest
= is the code of the family to which the individual belongnord
= is the position of the individual in the family ( 1=husband, 2=wife, 3= son, etc..)tpens
= is the wage that each one of them earnI need to sum the values of the wage that are referred to the same individual. For example
As you can see, these values of tpens
are referred to the same individual because not only nquest
is the same ( family code) , but also nord
.
I've tried to do it in 2 ways ( following some suggestions )
First way
new_dataset <- dataset %>%
replace(is.na(.), 0) %>%
group_by(nquest, nord) %>%
summarize(tpens = sum(tpens), .groups = 'drop')
Second way
new_dataset <- dataset %>%
replace(is.na(.), 0) %>%
group_by(nquest, nord) %>%
summarize(tpens = sum(tpens), .groups = 'keep') %>%
ungroup
Are they right?
Can anyone explain me the difference between computing the sum with keep
groups and then ungroup
and instead drop
the groups directly ??
I'm a bit confused because I do not understand this thing: if I make the sum of the values that correspond to each individual, I should not have groups at the end of the process... but just 1 indvidual per rows ( Am I wrong?). If I merge this dataset with another one matching by nquest
and nord
( hence for each person ), I get instead # A tibble: 6 x 41 # Groups: nquest, nord [6]
.
How is that possible?
The difference between using .groups = 'keep'
and .groups = 'drop'
lies in the state of the tibble
after these functions. If you use .groups = 'keep'
, the tibble
will be grouped until you run ungroup()
. However, if you use .groups = 'drop'
, the tibble
will no longer be grouped after you run summarize
. To learn more, check out the "Verbs" section of the documentation here.
Take this example:
data("iris")
library(dplyr)
## Let's try "keep"
grouped <- iris %>%
group_by(Species) %>%\
summarise(count = n(), .groups = "keep")
grouped
#> # A tibble: 3 × 2
#> # Groups: Species [3]
#> Species count
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
grouped %>% group_data()
#> # A tibble: 3 × 2
#> Species .rows
#> <fct> <list<int>>
#> 1 setosa [1]
#> 2 versicolor [1]
#> 3 virginica [1]
## Now, let's try "drop"
ungrouped <- iris %>%
group_by(Species) %>%\
summarise(count = n(), .groups = "keep")
ungrouped
#> # A tibble: 3 × 2
#> # Groups: Species [3]
#> Species count
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
ungrouped %>% group_data()
#> # A tibble: 1 × 1
#> .rows
#> <list<int>>
#> 1 [3]
The key difference is these outputs is the grouping - if we do not ungroup()
or use .groups = 'drop'
, the output remains grouped. This means that future operations will treat this tibble
as grouped, which could create unintended consequences.
If you only need to use grouping for one function, try the .by
parameter. Learn more here. This way instead of having to remember to use .groups = 'drop'
or ungroup()
, you can just write:
iris %>%
summarise(count = n(), .by = Species)
#> # A tibble: 3 × 2
#> Species count
#> <fct> <int>
#> 1 setosa 50
#> 2 versicolor 50
#> 3 virginica 50
Learn more about grouped data here.