Subset string by counting specific characters

I have the following strings:

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")

I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:

some_function(strings)

c("ABBSDGN", "AABSDG", "AGN", "GGG")

I tried to use the stringi, stringr and regex expressions but I can't figure it out.

Solution

You can accomplish your task with a simple call to str_extract from the stringr package:

library(stringr)

strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")

str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:

str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN"  NA           "AGNA"       "GGGDSRTYHG"

There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:

m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"

Alternatively, you can use sub:

sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG"  "AGN"     "GGG"