Search code examples
rdplyrgroup

Is it possible to make groups based on an ID of a person in R?


I have this data:

data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
                   birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))

I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on. Person "4102" has no siblings and should be a NA in the new column. It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers. I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?


Solution

  • You can use round with digits = -1 (or -2) if you have id_pers that goes above 10 observations per family. If you want the id to be integers from 1; you can use cur_group_id:

    library(dplyr)
    data %>% 
      group_by(fam_id = round(id_pers - 5, digits = -1)) %>% 
      mutate(fam_gp = cur_group_id())
    

    output

    # A tibble: 15 × 3
    # Groups:   fam_id [10]
       id_pers birthyear fam_id fam_gp
         <dbl>     <dbl>  <dbl>  <int>
     1    4102      1992   4100      1
     2   13102      1994  13100      2
     3   27101      1993  27100      3
     4   27102      1992  27100      3
     5   28101      1995  28100      4
     6   28106      1999  28100      4
     7   42101      2000  42100      5
     8   42102      2001  42100      5
     9   56102      2000  56100      6
    10   73102      1994  73100      7
    11   74103      1999  74100      8
    12  103104      1978 103100      9
    13  117103      1986 117100     10
    14  117104      1998 117100     10
    15  117105      1999 117100     10