Search code examples
pythonregexp-replace

Why re.sub() adds not matched string by default in Python?


import re

print(re.sub(
    r'(size:)\D+(\d+)\D+(\d+)\D+(\d+)',
    r'\2x\3x\4',
    'START, size: 100Х200 x 50, END'
))

output:

'START, 100x200x50, END'

I do not expect string parts which are not mentioned in the regular expression. Everything outside — must be omitted.

Yes, it will work as expected if we mention whole content by adding .* to the end and the beginning:

r'.*(size:)\D+(\d+)\D+(\d+)\D+(\d+).*'

output will be:

'100x200x50'

...as it's suppose to be (for me) by default.

Why? =)

UPDATE

Yes, it is obvious, it's looking for a match to replace only it. (Working at midnight does not make it too obvious =D)

But it's great to have solution to avoid matching whole string.


Solution

  • You seem to have a misunderstanding of what sub does. it substitutes the matching regex. this regex r'(size:)\D+(\d+)\D+(\d+)\D+(\d+)' matches part of your string and so ONLY THE MATCHING PART will be substituted, the capture groups do not effect this. what you can do (if you don't want to add .* in the beginning and the end is to use re.findall like this

    re.findall(
        r'(size:)\D+(\d+)\D+(\d+)\D+(\d+)',
        'START, size: 100Х200 x 50, END'
        )
    

    which will return [('size:', '100', '200', '50')], you can then format it as you wish. one way to do is as one liner with no error handling is like this:

    '{1}x{2}x{3}'.format(
        *re.findall(
            r'(size:)\D+(\d+)\D+(\d+)\D+(\d+)',
            'START, size: 100Х200 x 50, END')[0]
        )