Search code examples
rdataframegroup-bygroupingigraph

Regroup elements in data frame based on pair information


Say I have a data frame like the following:

mydf <- data.frame(id=LETTERS, locus=c(rep("alpha",14),rep("beta",12)),
                   group=c(rep(1,2),rep(2,6),rep(3,6),rep(4,4),rep(5,4),rep(6,4)),
                   pair=c(1:12,14,15,3,4,6,8,9:16))

which looks like this:

> mydf
   id locus group pair
1   A alpha     1    1
2   B alpha     1    2
3   C alpha     2    3
4   D alpha     2    4
5   E alpha     2    5
6   F alpha     2    6
7   G alpha     2    7
8   H alpha     2    8
9   I alpha     3    9
10  J alpha     3   10
11  K alpha     3   11
12  L alpha     3   12
13  M alpha     3   14
14  N alpha     3   15
15  O  beta     4    3
16  P  beta     4    4
17  Q  beta     4    6
18  R  beta     4    8
19  S  beta     5    9
20  T  beta     5   10
21  U  beta     5   11
22  V  beta     5   12
23  W  beta     6   13
24  X  beta     6   14
25  Y  beta     6   15
26  Z  beta     6   16

It has an id column as a primary key, and ids are divided into locus alpha and beta. Each id in alpha should have a pair in beta (column pair), though there might be cases of orphan alpha or beta ids (due to prior filtering).

ids are also grouped in different groups (column group), which were defined on a per-locus basis.

Now I want to define another grouping variable new_group that reflects the alpha-beta pairing information.

This seems very straightforward visually, but I am struggling to define the rules in an automatic way, so that every possible situation is taken into account.

The desired output for the example data frame above would be the following (colors just for guidance):

test

Note that I only define a new_group when pair information is found, so ids A and B do not belong to any new_group.

ids C, D, F and H have pair information in beta, so they all constitute new_group 1; since ids E and G belong to the same group, they are pooled together (even if they do not have pair information).

ids from I to N have pair information in beta grouped in different groups, so new_group 2 has to be split in beta into 2a and 2b.

Essentially, rules would be:

  • make a new group when at least one id of a group contains pair info in the other locus
  • for ids without pair info, pool them into the same new group if other ids from their group have paired info
  • groups in alpha can be paired to multiple groups in beta, and conversely groups in beta can be paired to multiple groups in alpha, in which case letter suffixes are used

Solution

  • Here is an igraph solution

    mbs <- mydf %>%
      mutate(sg = str_c("g", group), sp = str_c("p", pair)) %>%
      select(sg, sp) %>%
      graph_from_data_frame(directed = FALSE) %>%
      components() %>%
      membership()
    
    mydf %>%
      mutate(new_group = mbs[startsWith(names(mbs), "g")][str_c("g", group)]) %>%
      mutate(
        new_group = ifelse(n_distinct(locus) == 2, new_group - 1, NA),
        .by = new_group
      ) %>%
      mutate(
        new_group = if (n_distinct(group) == 1) as.character(new_group) else str_c(new_group, "_",match(group, unique(group))),
        .by = c(locus, new_group)
      )
    

    which gives

       id locus group pair new_group
    1   A alpha     1    1      <NA>
    2   B alpha     1    2      <NA>
    3   C alpha     2    3         1
    4   D alpha     2    4         1
    5   E alpha     2    5         1
    6   F alpha     2    6         1
    7   G alpha     2    7         1
    8   H alpha     2    8         1
    9   I alpha     3    9         2
    10  J alpha     3   10         2
    11  K alpha     3   11         2
    12  L alpha     3   12         2
    13  M alpha     3   14         2
    14  N alpha     3   15         2
    15  O  beta     4    3         1
    16  P  beta     4    4         1
    17  Q  beta     4    6         1
    18  R  beta     4    8         1
    19  S  beta     5    9       2_1
    20  T  beta     5   10       2_1
    21  U  beta     5   11       2_1
    22  V  beta     5   12       2_1
    23  W  beta     6   13       2_2
    24  X  beta     6   14       2_2
    25  Y  beta     6   15       2_2
    26  Z  beta     6   16       2_2