Search code examples
sqlregexposixdb2-400

Regex sub-group behaviour with and without space


Say the task were to append the last numbers in a product code to itself with a hyphen between the original and added numbers (purely for experimentation).

I would like to understand why including a space is necessary in the following example:

with foo ( prod )                         
as ( values ('MYPRODUCT 123'))            
select                                    
 'dot aster space' as test_type,          
 '''(.* (\d+))'',''$1-$2''' as the_regex, 
 regexp_replace(prod,'(.* (\d+))','$1-$2')
from foo                                  
 UNION ALL                                
select                                    
 'dot aster no space',                    
 '''(.*(\d+))'',''$1-$2''',               
 regexp_replace(prod,'(.*(\d+))','$1-$2') 
from foo                                  

Result

TEST_TYPE           THE_REGEX             REGEXP_REPLACE   
dot aster space     '(.* (\d+))','$1-$2'  MYPRODUCT 123-123
dot aster no space  '(.*(\d+))','$1-$2'   MYPRODUCT 123-3  

I would have expected that, since the period matches any character, including a blank space, the two regexes would have the same result.

However, even accepting that they do not, I can't figure out why only the last 3 is captured in the second group.

Thanks.


Solution

  • It's a matter of greediness.

    With the regex

    '(.* (\d+))'

    you ask explicitely for a space before the digits, so \d+ will get the 3 digits.

    With the regex

    '(.*(\d+))'

    the dot .* will take as many characters as it can before matching a digit or more. So .* will match 'MYPRODUCT 12' and \d+ will match '3'.

    Solution : the non-greedy quantifier '?'. The regex would be

    '(.*?(\d+))'

    and it will match a maximum digits for \d+, then the remainder for .*