Search code examples
pythonregexpython-re

Matching repeating words in a row by regex


I would like to find a replace repeating words in the string, but only if the are next to each other or separated by a space. For example:

"<number> <number>" -> "<number>"
"<number><number>"-> "<number>"

but not

"<number> test <number>" -> "<number> test <number>"

I have tried this:

import re
re.sub(f"(.+)(?=\<number>+)","", label).strip()

but it would give the wrong result for the last test option.

Could you please help me with that?


Solution

  • You can use

    re.sub(r"(<number>)(?:\s*<number>)+",r"\1", label).strip()\
    

    See the regex demo. Details:

    • (<number>) - Group 1: a <number> string
    • (?:\s*<number>)+ - one or more occurrences of the following sequence of patterns:
      • \s* - zero or more whitespaces
      • <number> - a <number> string

    The \1 is the replacement backreference to the Group 1 value.

    Python test:

    import re
    text = '"<number> <number>", "<number><number>", not "<number> test <number>"'
    print( re.sub(r"(<number>)(?:\s*<number>)+", r'\1', text) )
    # => "<number>", "<number>", not "<number> test <number>"