Search code examples
pythonpython-3.xregexpython-re

Regex (Python) - How to match part of a string multiple times?


I need a python regex matching the part of a string multiple times:

My String: aaaa-bb-ccc-dd
My Pattern: ([A-z]+)\-([A-z]+)

I would like to have groups like this:
1: aaaa-bb
2: bb-ccc
3: ccc-dd

Does somebody have an idea on how to do this? If it does not work with regex only, a solution with a python for loop is also very welcome.


Solution

  • You can use lookahead to get overlapping matches:

    (?=\b([A-Za-z]+-[A-Za-z]+)\b)
    

    See the regex demo.

    Details:

    • (?= - start of a positive lookahead that matches a location that is immediately followed with
      • \b - a word boundary
      • ([A-Za-z]+-[A-Za-z]+) - Group 1: one or more ASCII letters, -, one or more ASCII letters
      • \b - a word boundary
    • ) - end of the lookahead.

    In Python, use it with re.findall:

    import re
    text = "aaaa-bb-ccc-dd"
    print( re.findall(r'(?=\b([A-Z]+-[A-Z]+)\b)', text, re.I) )
    # => ['aaaa-bb', 'bb-ccc', 'ccc-dd']
    

    See the Python demo. Note I changed [A-Za-z] to [A-Z] in the code since I made the regex matching case insensitive with the help of the re.I option. Make sure you are using the r string literal prefix or \b will be treated as a BACKSPACE char, \x08, and not a word boundary.

    Variations

    • (?=\b([^\W\d_]+-[^\W\d_]+)\b) - matching any Unicode letters
    • (?=(?<![^\W\d_])([^\W\d_]+-[^\W\d_]+)(?![^\W\d_])) - matching any Unicode letters and the boundaries are any non-letters