I have a dataframe df
which contains a single column GO
. Each row in df
contains either one term or multiple terms (separated by ;
) and each term has a specific format - it starts with either P, C or F and is followed by a :
and then the actual term.
df <- data.frame(
GO = c("C:mitochondrion; C:kinetoplast", "", "F:calmodulin binding; C:cytoplasm; C:axoneme",
"", "P:cilium movement; P:inner dynein arm assembly; C:axoneme", "", "F:calcium ion binding"))
GO
1 C:mitochondrion; C:kinetoplast
2
3 F:calmodulin binding; C:cytoplasm; C:axoneme
4
5 P:cilium movement; P:inner dynein arm assembly; C:axoneme
6
7 F:calcium ion binding
I want to split this column into three columns BP
, CC
, MF
based on whether the terms start with a P
, C
or an F
respectively. Also I want the three columns to have only the terms and not the other identifiers (P, C, F and :
).
This is what I want my new dataframe to look like:
BP CC MF
1 mitochondrion; kinetoplast
2
3 cytoplasm; axoneme calmodulin binding
4
5 cilium movement; inner dynein arm assembly axoneme
6
7 calcium ion binding
A tidyverse
approach to achieve your desired result may look like so:
library(tidyr)
library(dplyr)
df %>%
mutate(id = seq(nrow(.))) %>%
separate_rows(GO, sep = ";\\s") %>%
separate(GO, into = c("category", "item"), sep = ":") %>%
mutate(category = recode(category, C = "CC", P = "BP", F = "MF", .default = "foo")) %>%
replace_na(list(item = "")) %>%
group_by(id, category) %>%
summarise(items = paste(item, collapse = "; "), .groups = "drop") %>%
pivot_wider(names_from = category, values_from = items, values_fill = "") %>%
select(BP, CC, MF)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [3, 7,
#> 11].
#> # A tibble: 7 × 3
#> BP CC MF
#> <chr> <chr> <chr>
#> 1 "" "mitochondrion; kinetoplas… ""
#> 2 "" "" ""
#> 3 "" "cytoplasm; axoneme" "cal…
#> 4 "" "" ""
#> 5 "cilium movement; inner dynein arm assembly" "axoneme" ""
#> 6 "" "" ""
#> 7 "" "" "cal…