Search code examples
rdataframetidyr

Separate collapsed column into multiple rows, preserving grouping information


I have a messy table I read into a data frame similar to the following simplified one, which I want to parse.

mydf <- data.frame(group=LETTERS[1:5], code=101:105, ids=c('g1:id1,id2,id3\ng2:id4,id5',
                                                           'id6,id7,id8,id9',
                                                           'g1:id10,id11\ng3:id12',
                                                           'g2:id13,id14',
                                                           'id15'))

It looks like:

> mydf
  group code                        ids
1     A  101 g1:id1,id2,id3\ng2:id4,id5
2     B  102            id6,id7,id8,id9
3     C  103      g1:id10,id11\ng3:id12
4     D  104               g2:id13,id14
5     E  105                       id15

I need to separate the collapsed ids column into multiple rows, to obtain a final data frame into long format.

For this, I would usually apply separate_rows from tidyr, but this case has an additional problem: some of the collapsed ids are grouped into subgroups.

I guess I should first separate those subgroups into different ids columns, and then apply separate_rows or separate_longer_delim on those columns... I do not know how to go about this.

The final data frame I expect would be like this, in long format, with the subgroup information in an additional column:

> mydf_new
   group code   id subgroup
1      A  101  id1       g1
2      A  101  id2       g1
3      A  101  id3       g1
4      A  101  id4       g2
5      A  101  id5       g2
6      B  102  id6     <NA>
7      B  102  id7     <NA>
8      B  102  id8     <NA>
9      B  102  id9     <NA>
10     C  103 id10       g1
11     C  103 id11       g1
12     C  103 id12       g3
13     D  104 id13       g2
14     D  104 id14       g2
15     E  105 id15     <NA>

Solution

  • One way is to separate wider / longer across a few steps:

    library(tidyr)
    library(dplyr)
    
    mydf |> 
      separate_longer_delim(ids, "\n") |> 
      separate_wider_delim(ids, ":", names = c("subgroup", "id"), too_few = "align_end") |> 
      separate_longer_delim(id, ",") |> 
      relocate(subgroup, .after = last_col())
    
    # A tibble: 15 × 4
       group  code id    subgroup
       <chr> <int> <chr> <chr>   
     1 A       101 id1   g1      
     2 A       101 id2   g1      
     3 A       101 id3   g1      
     4 A       101 id4   g2      
     5 A       101 id5   g2      
     6 B       102 id6   NA      
     7 B       102 id7   NA      
     8 B       102 id8   NA      
     9 B       102 id9   NA      
    10 C       103 id10  g1      
    11 C       103 id11  g1      
    12 C       103 id12  g3      
    13 D       104 id13  g2      
    14 D       104 id14  g2      
    15 E       105 id15  NA