Search code examples
pythonregexstringdata-extraction

Regular expression for extracting Russian passport numbers


I need regular extraction that extract passport number after specific word паспорт .

Possible options are:

  • паспорт 5715 424141
  • паспорт 5715-424141
  • паспорт 5715 - 424141

I need to extract first 4 and last 6 numbers after word паспорт occurred, so result should be 5715 and 424141.

I tried ^(\d{4})\ (\d{6})$ but it's not detected my pattern.


Solution

  • For starters, the ^ symbol means the start of the string, so that already fails your pattern (as the strings start with "паспорт").

    It also seems that the - between the number groups is optional and may have spaces which you don't support.

    To fix all those issues, use:

    паспорт (\d{4})\s*-?\s*(\d{6})
    
    • паспорт - literal match.
    • (\d{4}) - a capture group of four digits.
    • \s* - any number of spaces, including 0.
    • -? - an optional dash.
    • \s* - any number of spaces, including 0.
    • (\d{6}) - a capture group of six digits.

    And since you tagged with Python:

    import re
    
    s = """паспорт 5715 424141
    паспорт 5715-424141
    паспорт 5715 - 424141"""
    
    for line in s.splitlines():
        print(re.search(r"паспорт (\d{4})\s*-?\s*(\d{6})", line).groups())
    # ('5715', '424141')
    

    Regex demo