Search code examples
rregexpcre

regular expression to match up to first instance of repeated character


My example data:

l1
[1] "xmms-1.2.11-x86_64-5"     "xmms-1.2.11-x86_64-6"    
[3] "xmodmap-1.0.10-x86_64-1"  "xmodmap-1.0.9-x86_64-1"  
[5] "xmodmap3-1.0.10-x86_64-1" "xmodmap3-1.0.9-x86_64-1"

I am using R and would like a regular expression that will capture just the characters before the first dash. Such as

xmms
xmms
xmodmap
xmodmap
xmodmap3
xmodmap3

Since I am using R, the regex needs to be Perl compliant.

I thought I could do this with using a lookbehind on the dash, but I just get a match for the whole string. This is the pattern I tried: grepl("(?<=[a-z0-9])-",l1, perl=T) , but it just matches the whole string. I think if I had the first dash as a capture group, I could maybe use the lookbehind, but I don't know how to build the regex with the lookbehind and the capture group.

I looked around at some other questions for possible answers and it seems maybe I need an non-greedy symbol? I tried grepl("(?<=[a-z0-9])-/.+?(?=-)/",l1, perl=T), but that didn't work either.

I'm open to other suggestions on how to capture the first set of characters before the dash. I'm currently in base R, but I'm fine with using any packages, like stringr.


Solution

  • 1) Base R An option is sub from base R to match the - followed by characters (.*) and then replace with blank ("")

    sub("-.*", "", l1)
    #[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"
    

    Or capture as a group

    sub("(\\w+).*", "\\1", l1)
    #[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"
    

    Or with regmatches/regexpr

    regmatches(l1, regexpr('\\w+', l1))
    #[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"
    

    or using trimws

    trimws(l1,  "right", whitespace = "-.*")
    #[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"
    

    Or using read.table

    read.table(text = l1, sep="-", header = FALSE, stringsAsFactors = FALSE)$V1
    #[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"
    

    or with strsplit

    sapply(strsplit(l1, "-"), `[`, 1)
    

    2) stringr Or with word from stringr

    library(stringr)
    word(l1, 1, sep="-")
    

    Or with str_remove

    str_remove(l1, "-.*")
    #[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"
    

    3) stringi Or with stri_extract_first from stringi

    library(stringi)
    stri_extract_first(l1, regex = "\\w+")
    #[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"
    

    Note: grep/grepl is for detecting a pattern in the string. For replacing/extracting substring, use sub/regexpr/regmatches in base R

    data

    l1 <- c("xmms-1.2.11-x86_64-5", "xmms-1.2.11-x86_64-6", "xmodmap-1.0.10-x86_64-1", 
    "xmodmap-1.0.9-x86_64-1", "xmodmap3-1.0.10-x86_64-1", "xmodmap3-1.0.9-x86_64-1"
    )