Search code examples
rregexprotein-database

Extract all substrings in string


I want to extract all substrings that begin with M and are terminated by a *

The string below as an example;

vec<-c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")

Would ideally return;

MGMTPRLGLESLLE
MTPRLGLESLLE

I have tried the code below;

regmatches(vec, gregexpr('(?<=M).*?(?=\\*)', vec, perl=T))[[1]]

but this drops the first M and only returns the first string rather than all substrings within.

"GMTPRLGLESLLE"

Solution

  • You can use

    (?=(M[^*]*)\*)
    

    See the regex demo. Details:

    • (?= - start of a positive lookahead that matches a location that is immediately followed with:
    • (M[^*]*) - Group 1: M, zero or more chars other than a * char
    • \* - a * char
    • ) - end of the lookahead.

    See the R demo:

    library(stringr)
    vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
    matches <- stringr::str_match_all(vec, "(?=(M[^*]*)\\*)")
    unlist(lapply(matches, function(z) z[,2]))
    ## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE" 
    

    If you prefer a base R solution:

    vec <- c("SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ")
    matches <- regmatches(vec, gregexec("(?=(M[^*]*)\\*)", vec, perl=TRUE))
    unlist(lapply(matches, tail, -1))
    ## => [1] "MGMTPRLGLESLLE" "MTPRLGLESLLE"