Search code examples
regexrcurly-braces

How do curly braces in R regex work?


I am trying to understand how the curly braces in R regular expression work. The help files say:

{n} The preceding item is matched exactly n times.

{n,} The preceding item is matched n or more times.

{n,m} The preceding item is matched at least n times, but not more than m times.

I have a vector like this:

b <- c("aa", "aaa", "aaaa", "aaaaa")

When I do

b[grep("a{2}", b)]

I would expect it to return only "aa" but instead I get everything. In other words, it yields exactly the same result as

b[grep("a{2,}", b)]

Why?


Solution

  • Because in this aaa input a{2} matches the first two a's likewise for all the other elements. So the grep returns index of all the elements. To do an exact string match, you must need to add anchors.

    > b <- c("aa", "aaa", "aaaa", "aaaaa")
    > b[grep("^a{2}$", b)]
    [1] "aa"
    

    ^ asserts that we are at the start and $ asserts that we are at the end. So the above grep returns only the index of the element which has exactly two a's ie, 1.

    OR

    > b <- c("aa", "aaa", "aaaa", "aaaaa")
    > b[grep("\\ba{2}\\b", b)]
    [1] "aa"
    

    Adding \b word boundary will also works for this case.