Search code examples
rregexdplyrstringrmutate

How do I fix 'mutate()' error in simple 'stringr' regex code


I am trying to sort through a Trinotate RNA-seq output file in R to extract all gene ontology terms (i.e., GO:00057) associated with gene IDs (i.e.,TRINITY_DN142883_c0_g1) from a long string of mixed character/numeric characters. I'm using 'dplyr' and 'strings'.

My data looks like this:

Gene GOX
TRINITY_DN142883_c0_g1 GO:0003779^molecular_function^actin bindingGO:0045886^biological_process^negative regulation of synaptic assembly at neuromuscular junctionGO:0016567^biological_process^protein ubiquitination
TRINITY_DN142917_c0_g1 GO:0016020^cellular_component^membraneGO:0000325^cellular_component^plant-type vacuoleGO:0005886^cellular_component^plasma membraneGO:0005774^cellular_component^vacuolar membraneGO:0005773^cellular_component^vacuoleGO:0004016^molecular_function^adenylate cyclase activityGO:0015079^molecular_function^potassium ion transmembrane transporter activity`GO:0006813^biological_process^potassium ion transport

my code is this (I'm trying to remove all words from 'GOX' column so that only GO terms remain):

practice%>%
  select(GOX)%>%
  mutate(col=str_replace_all(practice,"\\w",""))

Depending on the size of my data frame (2 vs 10 rows), I get different errors:

Error 1: Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `col = str_replace_all(practice, "\\w", "")`.
Caused by warning in `stri_replace_all_regex()`:
! argument is not an atomic vector; coercing 

OR Error 2:

Error in `mutate()`:
ℹ In argument: `col = str_replace_all(practice, "\\w", "")`.
Caused by error:
! `col` must be size 10 or 1, not 2.
Run `rlang::last_trace()` to see where the error occurred.

I've looked over stack overflow, and there seem to be similar errors but the fixes don't seem appropriate (i.e., this data doesn't have missing values, I believe the data.frame format is appropriate for str_replace_all as long it is used with 'dplyr', etc). Maybe this is caused by all of the symbols in my 'GOX' column? I'd appreciate any help, and I'm open to completely different techniques as I'm pretty unexperienced with regex.

Thanks!


Solution

  • I'd take the opposite approach and keep only those terms that match your pattern. str_extract_all followed by unnest also has the advantage of putting your data in tidy format, which will most likely help with downstream processing.

    The select(-GOX) is there only to assist in formating the result nicely.

    Assume your data frame is df.

    library(tidyverse)
    
    df %>% 
      mutate(GOTerm = str_extract_all(GOX, "(GO:\\d+)")) %>% 
      unnest(GOTerm) %>% 
      select(-GOX)
    # A tibble: 11 × 2
       Gene                   GOTerm   
       <chr>                  <chr>     
     1 TRINITY_DN142883_c0_g1 GO:0003779
     2 TRINITY_DN142883_c0_g1 GO:0045886
     3 TRINITY_DN142883_c0_g1 GO:0016567
     4 TRINITY_DN142917_c0_g1 GO:0016020
     5 TRINITY_DN142917_c0_g1 GO:0000325
     6 TRINITY_DN142917_c0_g1 GO:0005886
     7 TRINITY_DN142917_c0_g1 GO:0005774
     8 TRINITY_DN142917_c0_g1 GO:0005773
     9 TRINITY_DN142917_c0_g1 GO:0004016
    10 TRINITY_DN142917_c0_g1 GO:0015079
    11 TRINITY_DN142917_c0_g1 GO:0006813