Search code examples
regexregex-lookarounds

Python regex for sequence containing at least two digits/letters


using the Python module re, I would like to detect sequences that contain at least two letters (A-Z) and at least two digits (0-9) from a text, e.g., from the text

"N03FZ467 other text N03671"

precisely the sub-string "N03FZ467" shall be matched.

The best I have got so far is

(?=[A-Z]*\d)[A-Z0-9]{4,}

which detects sequences of length at least 4 that contain only letters A-Z and digits 0-9, and at least one digit and one letter. How can I make sure I respectively get at least two?


Solution

    1. If you want to match full words, start matching at word boundaries \b.
    2. Check the first condition (two upper) by a lookahead: (?=(?:\d*[A-Z]){2})
    3. If this succeeds, match the second requirement, two digits: (?:[A-Z]*\d){2}
    4. Finally match any remaining [A-Z\d]* until another \b.

    Putting it together:

    \b(?=(?:\d*[A-Z]){2})(?:[A-Z]*\d){2}[A-Z\d]*\b
    

    See this demo at regex101 or a Python demo at tio.run

    Note that a lookahead is a zero length assertion, it does not consume characters. If you don't specifiy a starting point eg \b, the lookahead will be used at any place which is less efficient.
    Further to mention, the minimum length of at least four will be satisfied by the requirements.