Search code examples
regexgrouping

regex match two words based on a matching substring


there are 4 strings as shown below

ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv 

Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.

What I tried so far

ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv

This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.

But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :

ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv

I tried to use Negative Lookahead:

ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv

but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.

Any way to achieve the desired matching?


Solution

  • You need

    ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
    

    See the regex demo.

    Note

    • .*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
    • (?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
    • \.csv - the . is escaped to match a literal dot.