Search code examples
rdataframe

Split data frame by group based on the occurance of an id


I'm trying to find a solution in base R that can split a data.frame in groups, based on values in a column (group), that are also dependent by the occurrence of values in another column (id).

For example:

I have a data.frame

df = data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,17,18,18,19,19,20,20), 
                group = c("A","A","A","A","A","A","A","A","B","B","B","B","C","C","C","C","A","B","A","B","A","B","A","B"),
                num = c(0.1,0.1,0.1,0.1,0.2,0.2,0.2,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1),
                value = c(sample(10:90, 16, replace = TRUE), rep(c(21,67,80,69), times=1, each=2)))

> df
   id group num value
1   1     A 0.1    59
2   2     A 0.1    72
3   3     A 0.1    82
4   4     A 0.1    17
5   5     A 0.2    39
6   6     A 0.2    46
7   7     A 0.2    39
8   8     A 0.2    56
9   9     B 0.1    31
10 10     B 0.1    46
11 11     B 0.1    63
12 12     B 0.1    15
13 13     C 0.1    51
14 14     C 0.1    68
15 15     C 0.1    48
16 16     C 0.1    28
17 17     A 0.1    21
18 17     B 0.1    21
19 18     A 0.1    67
20 18     B 0.1    67
21 19     A 0.1    80
22 19     B 0.1    80
23 20     A 0.1    69
24 20     B 0.1    69

and I'm trying to split the data.frame based on the following groups:

> df
$`A`
   id group num value
1   1     A 0.1    59
2   2     A 0.1    72
3   3     A 0.1    82
4   4     A 0.1    17
5   5     A 0.2    39
6   6     A 0.2    46
7   7     A 0.2    39
8   8     A 0.2    56

$`B`
   id group num value
9   9     B 0.1    31
10 10     B 0.1    46
11 11     B 0.1    63
12 12     B 0.1    15

$`C`
   id group num value
13 13     C 0.1    51
14 14     C 0.1    68
15 15     C 0.1    48
16 16     C 0.1    28

$`AB`
   id group num value
17 17     A 0.1    21
18 17     B 0.1    21
19 18     A 0.1    67
20 18     B 0.1    67
21 19     A 0.1    80
22 19     B 0.1    80
23 20     A 0.1    69
24 20     B 0.1    69

The last group is identified by pairs based on the column id. Meaning, if the id comes in pairs (17,18,19,20), the group is considered a separate group (AB) compared to the group (A and B) where the id dose not come in pairs (1:12).

How can this be accomplished with base R? Can this be done using the function split()?


Solution

  • You can first use ave to generate the groups you wanted, then split base on it.

    To have the last group named as A + B, you'll need to collapse the ave results with paste with + (or obviously, sub the previous toString results). To have it being split at the end, the most convenient way I can think of is to use fct_inorder in the forcats package, otherwise, use the excellent codes shared by @Friede if you want to stay in base R.

    split(df, forcats::fct_inorder(ave(df$group, df$id, FUN = \(x) paste(x, collapse = " + "))))
    
    $A
      id group num value
    1  1     A 0.1    78
    2  2     A 0.1    66
    3  3     A 0.1    18
    4  4     A 0.1    81
    5  5     A 0.2    35
    6  6     A 0.2    16
    7  7     A 0.2    51
    8  8     A 0.2    18
    
    $B
       id group num value
    9   9     B 0.1    45
    10 10     B 0.1    87
    11 11     B 0.1    90
    12 12     B 0.1    52
    
    $C
       id group num value
    13 13     C 0.1    85
    14 14     C 0.1    24
    15 15     C 0.1    41
    16 16     C 0.1    16
    
    $`A + B`
       id group num value
    17 17     A 0.1    21
    18 17     B 0.1    21
    19 18     A 0.1    67
    20 18     B 0.1    67
    21 19     A 0.1    80
    22 19     B 0.1    80
    23 20     A 0.1    69
    24 20     B 0.1    69