Search code examples
rdplyrtidyversedata-wrangling

Create new groups based on age span and other categories


How can I divide population into age groups of a certain age-span?

More specifically, I would like to create age groups with 5 ages in each group: 15-20, 21-26, 27-32, and so on. I also want to keep the categories marriage_status and gender. I've given it a try, but I'm a bit stuck.

# data
tibble::tribble(
     ~region, ~marriage_status, ~age,   ~gender, ~population, ~year,
     "Riket",         "ogifta",   15,     "män",       56031,  1968,
     "Riket",         "ogifta",   15, "kvinnor",       52959,  1968,
     "Riket",         "ogifta",   16,     "män",       55917,  1968,
     "Riket",         "ogifta",   16, "kvinnor",       52979,  1968,
     "Riket",         "ogifta",   17,     "män",       55922,  1968,
     "Riket",         "ogifta",   17, "kvinnor",       52050,  1968,
     "Riket",         "ogifta",   18,     "män",       58681,  1968,
     "Riket",         "ogifta",   18, "kvinnor",       51862,  1968,
     "Riket",         "ogifta",   19,     "män",       60387,  1968,
     "Riket",         "ogifta",   19, "kvinnor",       49750,  1968,
     "Riket",         "ogifta",   20,     "män",       62487,  1968,
     "Riket",         "ogifta",   20, "kvinnor",       50089,  1968,
     "Riket",         "ogifta",   21,     "män",       60714,  1968,
     "Riket",         "ogifta",   21, "kvinnor",       43413,  1968,
     "Riket",         "ogifta",   22,     "män",       56801,  1968,
     "Riket",         "ogifta",   22, "kvinnor",       36301,  1968,
     "Riket",         "ogifta",   23,     "män",       49862,  1968,
     "Riket",         "ogifta",   23, "kvinnor",       29227,  1968,
     "Riket",         "ogifta",   24,     "män",       42143,  1968,
     "Riket",         "ogifta",   24, "kvinnor",       23155,  1968
     )

# Create groups
pop_clean %>%
  group_by(gender, marriage_status) %>% 
  group_by(grp = cut(age, seq(15, 74, by = 5)))

The output is kinda what I want, but it gives some NA's and the groups are overlapping.

Any help greatly appriciated!

 region marriage_status   age gender  population  year grp    
   <chr>  <chr>           <dbl> <chr>        <dbl> <dbl> <fct>  
 1 Riket  ogifta             15 män          56031  1968 NA     
 2 Riket  ogifta             15 kvinnor      52959  1968 NA     
 3 Riket  ogifta             16 män          55917  1968 (15,20]
 4 Riket  ogifta             16 kvinnor      52979  1968 (15,20]
 5 Riket  ogifta             17 män          55922  1968 (15,20]

Solution

  • In cut, you need to include the include.lowest = TRUE argument to include the left-limit. To follow the interval in your question (i.e. 15-20, 21-26, 27-32 etc.), I suggest adding labels to the cut function.

    If you want to group all of the age into different intervals, you don't need to use group_by, mutate is enough for this.

    library(dplyr)
    
    pop_clean %>% mutate(grp = cut(age, 
                                   breaks = seq(15, 75, by = 6), 
                                   labels = paste0(seq(15, 70, by = 6), "-", seq(20, 75, by = 6)),
                                   include.lowest = T,
                                   right = F))
    
    # A tibble: 20 × 7
       region marriage_status   age gender  population  year grp  
       <chr>  <chr>           <dbl> <chr>        <dbl> <dbl> <fct>
     1 Riket  ogifta             15 män          56031  1968 15-20
     2 Riket  ogifta             15 kvinnor      52959  1968 15-20
     3 Riket  ogifta             16 män          55917  1968 15-20
     4 Riket  ogifta             16 kvinnor      52979  1968 15-20
     5 Riket  ogifta             17 män          55922  1968 15-20
     6 Riket  ogifta             17 kvinnor      52050  1968 15-20
     7 Riket  ogifta             18 män          58681  1968 15-20
     8 Riket  ogifta             18 kvinnor      51862  1968 15-20
     9 Riket  ogifta             19 män          60387  1968 15-20
    10 Riket  ogifta             19 kvinnor      49750  1968 15-20
    11 Riket  ogifta             20 män          62487  1968 15-20
    12 Riket  ogifta             20 kvinnor      50089  1968 15-20
    13 Riket  ogifta             21 män          60714  1968 21-26
    14 Riket  ogifta             21 kvinnor      43413  1968 21-26
    15 Riket  ogifta             22 män          56801  1968 21-26
    16 Riket  ogifta             22 kvinnor      36301  1968 21-26
    17 Riket  ogifta             23 män          49862  1968 21-26
    18 Riket  ogifta             23 kvinnor      29227  1968 21-26
    19 Riket  ogifta             24 män          42143  1968 21-26
    20 Riket  ogifta             24 kvinnor      23155  1968 21-26