Search code examples
rsubstringstringrgsub

Extract everything before second delimiter in R


Building off this previous post: How to extract string after 2nd delimiter in R

Have a string like the following:

"dat1/set1/set1_covars.csv"

And want to extract all the values before the second / like:

"dat1/set1/"

I was using variations of:

sub("^([^/]+/){2}", "", "dat1/set1/set1_covars.csv")

With ^ and .* moved around in different places, but just can't seem to get the syntax right.

Any help would be appreciated.


Solution

  • This seems to work:

    sub("(^([^/]+/){2}).*$", "\\1", "dat1/set1/set1_covars.csv")
    
    • add () around the stuff that defines the stuff up to the second delimiter;
    • add .*$ to include the rest of the line;
    • replace the blank-string replacement with a replacement by the first capture group.

    @GregorThomas points out that for this example dirname() would work, but not if your tree is deeper.

    Alternatively:

    stringr::str_extract("dat1/set1/set1_covars.csv", "^([^/]+/){2}")
    

    It seemed as though you could also do this with a lookbehind expression (i.e., define a pattern "(?<=^([^/]+/){2}).*$" that says ".*$ preceded by two delimiters, but don't count delimiter stuff in the matched expression") and replacing with a blank, but we run into trouble:

    • repetition quantifiers (i.e. "{2}") aren't allowed in lookbehind expressions
    • if we spell out the repetition explicitly ("(?<=^[^/]+/[^/]+/).*$") and specify perl = TRUE then it notes we're only allowed to use fixed-length expressions
    • lookahead/lookbehind always hurts my brain anyway