Search code examples
pythonregexstringpandascontains

Find specific sequence of characters combining number and letters using regex in Python pandas


I am trying to find all rows in a pandas DataFrame for which the column col takes values of the format 1234-XX-YYY, where XX is a placeholder for any two capital letters (A-Z) and YYY is a placeholder for any three numbers [0-9].

Here is my code so far

How can I achieve the desired result?

df[df['col'].str.contains('^1234-\[A-Z]{2}\[d]{3}', na=False)]

Solution

  • When you escape an open [ you tell the regex engine to match it as a literal character. If you expect a - to appear at some place in the string, you need to add it to the pattern. Also, if you expect uppercase letters to appear, you need A-Z, not a-z.

    Use

    ^1234-[A-Z]{2}-[0-9]{3}$
    

    Details

    • ^ - start of string
    • 1234- - a literal string
    • [A-Z]{2} - two uppercase letters
    • - - a hyphen
    • [0-9]{3} - three digits
    • $ - end of string.