Search code examples
regexpcre

Match the first occurrence of an optional pattern


I am trying to extract names in messy strings like the following:

genus species subsp. name […] x name […] var. name; genus2 species2 subsp. name2 var. name2  
genus species subsp. name […] x name […] var. name  
genus species subsp. name […] var name  
genus species subsp. name var. name  
genus species subsp. name

Where […] can be a succession of any characters with no regular patterns.

The desired output is:

subsp. name x name var. name  
subsp. name x name var. name  
subsp. name var. name  
subsp. name var. name  
subsp. name

My regex looks like this:

(?i).*?\b((?:aff|cf|ssp|subsp|var)[\.\s]+)([a-z-]+).*?(\sx\s+[a-z-]+)?.*?(\svar[\.\s]+[a-z-]+)?.*

Here is a demo.

I'm using the lazy quantifier *? to find the first occurrence of some sort of anchors (e.g. subsp, x and var) in the strings that I can use to match a given pattern. The problem is that I don't manage to get the regex work for all instances because (\sx\s+[a-z-]+)? and (\svar[\.\s]+[a-z-]+)? are optional as the patterns matched don't exist in all the strings.

Is there a simple solution to get around this issue?


Solution

  • You can wrap the optional patterns with optional non-capturing groups to make the necessary capturing groups obligatory and force the regex engine to make at least one attempt to search for the patterns.

    That means you need to change all .*?(pattern-to-extract)? patterns to (?:.*?(pattern-to-extract))?. When the whole group is optional it may match an empty string and consider job done. When the group is wrapped with an optional group it is tried at least once and the initial .*? is guaranteed to get expanded as many times as necessary to get to the capturing group pattern.

    Use

    (?i).*?\b((?:aff|cf|ssp|subsp|var)[.\s]+)([a-z-]+)(?:.*?(\sx\s+[a-z-]+))?(?:.*?(\svar[.\s]+[a-z-]+))?.*
    

    Note that dots inside character classes match literal dots, no need to escape them.

    See the regex demo.