Search code examples
regexregex-lookaroundsregex-greedy

Find matches ending with a letter that is not a starting letter of the next match


Intro

I have a string containing diagnosis codes (ICD-10), not separated by any character. I would like to extract all valid diagnosis codes. Valid diagnosis codes are of the form

[Letter][between 2 and 4 numbers][optional letter that is not the next match starting letter]

The regex for this pattern is (I believe)

\w\d{2,4}\w?

Example

Here is an example

mystring='F328AG560F33'

In this example there are three codes:

'F328A' 'G560' 'F33'

I would like to extract these codes with a function like str_extract_all in R (preferably but not exclusively)

My solution so far

So far, I managed to come up with an expression like:

str_extract_all(mystring,pattern='\\w\\d{2,4}\\w?(?!(\\w\\d{2,4}\\w?))')

However when applied to the example above it returns

"F328"  "G560F"

Basically it misses the letter A in the first code, and misses altogether the last code "F33" by mistakenly assigning F to the preceding code.

Question

What am I doing wrong? I only want to extract values that end with a letter that is not the start of the next match, and if it is, the match should not include the letter.

Application

This question is of great relevance for example when mining patient Electronic Health Records that have not been validated.


Solution

  • Your matches are overlapping. In this case, you might use str_match_all that allows easy access to capturing groups and use a pattern with a positive lookahead containing a capturing group inside:

    (?i)(?=([A-Z]\d{2,4}(?:[A-Z](?!\d{2,4}))?))
    

    See the regex demo

    Details

    • (?= - a positive lookahead start (it will be run at every location before each char and at the end of the string
    • ( - Group 1 start
      • [A-Z] - a letter (if you use a case insensitive modifier (?i), it will be case insensitive)
      • \d{2,4} - 2 to 4 digit
      • (?: - an optional non-capturing group start:
        • [A-Z] - a letter
        • (?!\d{2,4}) - not followed with 2 to 4 digits
      • )? - the optional non-capturing group end
    • ) - Group 1 end
    • ) - Lookahead end.

    R demo:

    > library(stringr)
    > res <- str_match_all("F328AG560F33", "(?i)(?=([A-Z]\\d{2,4}(?:[A-Z](?!\\d{2,4}))?))")
    > res[[1]][,2]
    [1] "F328A" "G560"  "F33"