Search code examples
rdataframesplitgrouping

Split a dataframe column containing delimited strings into multiple columns and retain specific portions of the split strings


I have a dataframe df which contains a single column GO. Each row in df contains either one term or multiple terms (separated by ;) and each term has a specific format - it starts with either P, C or F and is followed by a : and then the actual term.

df <- data.frame(
  GO = c("C:mitochondrion; C:kinetoplast", "", "F:calmodulin binding; C:cytoplasm; C:axoneme",
     "", "P:cilium movement; P:inner dynein arm assembly; C:axoneme", "", "F:calcium ion binding"))


                                                         GO
1                            C:mitochondrion; C:kinetoplast
2                                                          
3              F:calmodulin binding; C:cytoplasm; C:axoneme
4                                                          
5 P:cilium movement; P:inner dynein arm assembly; C:axoneme
6                                                          
7                                     F:calcium ion binding

I want to split this column into three columns BP, CC, MF based on whether the terms start with a P, C or an F respectively. Also I want the three columns to have only the terms and not the other identifiers (P, C, F and :).

This is what I want my new dataframe to look like:

                                          BP                         CC                  MF
1                                            mitochondrion; kinetoplast                    
2                                                                                          
3                                                    cytoplasm; axoneme  calmodulin binding
4                                                                                          
5 cilium movement; inner dynein arm assembly                    axoneme                    
6                                                                                          
7                                                                       calcium ion binding

Solution

  • A tidyverse approach to achieve your desired result may look like so:

    library(tidyr)
    library(dplyr)
    
    df %>%
      mutate(id = seq(nrow(.))) %>%
      separate_rows(GO, sep = ";\\s") %>%
      separate(GO, into = c("category", "item"), sep = ":") %>%
      mutate(category = recode(category, C = "CC", P = "BP", F = "MF", .default = "foo")) %>%
      replace_na(list(item = "")) %>%
      group_by(id, category) %>%
      summarise(items = paste(item, collapse = "; "), .groups = "drop") %>%
      pivot_wider(names_from = category, values_from = items, values_fill = "") %>%
      select(BP, CC, MF)
    #> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [3, 7,
    #> 11].
    #> # A tibble: 7 × 3
    #>   BP                                           CC                          MF   
    #>   <chr>                                        <chr>                       <chr>
    #> 1 ""                                           "mitochondrion; kinetoplas… ""   
    #> 2 ""                                           ""                          ""   
    #> 3 ""                                           "cytoplasm; axoneme"        "cal…
    #> 4 ""                                           ""                          ""   
    #> 5 "cilium movement; inner dynein arm assembly" "axoneme"                   ""   
    #> 6 ""                                           ""                          ""   
    #> 7 ""                                           ""                          "cal…