I am trying to extract names in messy strings like the following:
genus species subsp. name […] x name […] var. name; genus2 species2 subsp. name2 var. name2
genus species subsp. name […] x name […] var. name
genus species subsp. name […] var name
genus species subsp. name var. name
genus species subsp. name
Where […]
can be a succession of any characters with no regular patterns.
The desired output is:
subsp. name x name var. name
subsp. name x name var. name
subsp. name var. name
subsp. name var. name
subsp. name
My regex looks like this:
(?i).*?\b((?:aff|cf|ssp|subsp|var)[\.\s]+)([a-z-]+).*?(\sx\s+[a-z-]+)?.*?(\svar[\.\s]+[a-z-]+)?.*
Here is a demo.
I'm using the lazy quantifier *?
to find the first occurrence of some sort of anchors (e.g. subsp
, x
and var
) in the strings that I can use to match a given pattern.
The problem is that I don't manage to get the regex work for all instances because (\sx\s+[a-z-]+)?
and (\svar[\.\s]+[a-z-]+)?
are optional as the patterns matched don't exist in all the strings.
Is there a simple solution to get around this issue?
You can wrap the optional patterns with optional non-capturing groups to make the necessary capturing groups obligatory and force the regex engine to make at least one attempt to search for the patterns.
That means you need to change all .*?(pattern-to-extract)?
patterns to (?:.*?(pattern-to-extract))?
. When the whole group is optional it may match an empty string and consider job done. When the group is wrapped with an optional group it is tried at least once and the initial .*?
is guaranteed to get expanded as many times as necessary to get to the capturing group pattern.
Use
(?i).*?\b((?:aff|cf|ssp|subsp|var)[.\s]+)([a-z-]+)(?:.*?(\sx\s+[a-z-]+))?(?:.*?(\svar[.\s]+[a-z-]+))?.*
Note that dots inside character classes match literal dots, no need to escape them.
See the regex demo.