Search code examples
regexrtaxonomygsub

r - Last match of regular expression


I am looking for an R pattern matching expression that extracts the last fully populated taxonomy in each element in the list. The taxonomies have always the same format (one letter two underscores and a word (some times inside square brackets). Taxonomies that are not fully populated they don't have the word after the two underscores.

I was able to build a expression that worked in one regular expression builder website
(.\_\_[A-Za-z\[\]]+)(?!.*__[A-Za-z\[\])
but had not luck using it or transforming it to use an R pattern matching methods in grep {base} or anything similar. Here is one of the things I tried

clean=gsub("(.\_\_[A-Za-z[]]+)(?!.*__[A-Za-z[]])","\\1",taxonomies,perl = TRUE)

Any suggestions? Thanks!

taxonomies=
  list('k__Bacteria; p__Bacteroidetes; c__[Saprospirae]; o__[Saprospirales]; f__Chitinophagaceae; g__; s__'
       ,'k__Bacteria; p__Actinobacteria; c__MB-A2-108; o__0319-7L14; f__; g__; s__'
       ,'k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales;f__Corynebacteriaceae; g__Corynebacterium; s__'
       ,'k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Rhodocyclales; f__Rhodocyclaceae; g__Methyloversatilis; s__'
       ,'k__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Myxococcales; f__; g__; s__'
       ,'k__Bacteria; p__Proteobacteria; c__[Deltaproteobacteria]; o__[W123]; f__[W123]; g__[W123]; s__[W123.012.123]'
       ,'k__Bacteria; p__Bacteroidetes; c__[Saprospirae]; o__[Saprospirales]; f__Chitinophagaceae')

Desired output

[1] "f__Chitinophagaceae"  "o__0319-7L14" "g__Corynebacterium"   
[4] "g__Methyloversatilis" "o__Myxococcales"  "s__[W123.012.123]"   
[7] "f__Chitinophagaceae" 

Edit Included desired output, example code gsub that is not working.


Solution

  • We can use stri_extract_last from stringi

    library(stringi)
    stri_extract_last(unlist(taxonomies), regex = '[A-Za-z]__\\[*[[:alnum:].-]+\\]*')
    #[1] "f__Chitinophagaceae"  "o__0319-7L14" "g__Corynebacterium"   
    #[4] "g__Methyloversatilis" "o__Myxococcales"  "s__[W123.012.123]"   
    #[7] "f__Chitinophagaceae" 
    

    Here, I assumed that the OP meant to extract the characters within **...**. It must be some formatting issue as it was not shown in BOLD.

    data

    taxonomies=list(
      'k__Bacteria; p__Bacteroidetes; c__[Saprospirae]; o__[Saprospirales]; f__Chitinophagaceae; g__; s__'
      ,'k__Bacteria; p__Actinobacteria; c__MB-A2-108; o__0319-7L14; f__; g__; s__'
      ,'k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales;f__Corynebacteriaceae; g__Corynebacterium; s__'
     ,'k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Rhodocyclales; f__Rhodocyclaceae; g__Methyloversatilis; s__'
     ,'k__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Myxococcales; f__; g__; s__'
      ,'k__Bacteria; p__Proteobacteria; c__[Deltaproteobacteria]; o__[W123]; f__[W123]; g__[W123]; s__[W123.012.123]'
      ,'k__Bacteria; p__Bacteroidetes; c__[Saprospirae]; o__[Saprospirales]; f__Chitinophagaceae'
      )