Search code examples
pythonregexstringextractiban

Avoid extracting IBAN number from string


I am trying to avoid extracting the IBAN number from my string.

Example:

def get_umsatzsteuer_identifikationsnummer(string):
  # Demo --> https://regex101.com/r/VHaS7Y/1
  
  reg = r'DE[0-9 ]{12}|DE[0-9]{9}|DE [0-9]{9}'
  match = re.compile(reg)
  matched_words = match.findall(string)

  return matched_words


string = "I want to get this DE813992525 and this DE813992526 number and this
 number DE 813 992 526 and this number  DE 813992526. I do not want the bank
 account number: IBAN DE06300501100011054517."

get_umsatzsteuer_identifikationsnummer(string)


>>>>> ['DE813992525',
 'DE813992526',
 'DE 813 992 526',
 'DE 813992526',
 'DE063005011000']

The last number in the results, is (the first part) of the German IBAN number, which I don't want to extract. How can I avoid it?


Solution

  • You can shorten the alternation by making the space optional. If you don't want the last number, but you do want the number that ends with a dot, you can assert that the pattern is not followed by a digit.

    \b(?:DE[0-9 ]{12}|DE ?[0-9]{9})(?!\d)
    

    Regex demo

    You might also make it a bit more precise matching 3 times 3 digits preceded by a space for the third example, as [0-9 ]{12} could also possibly match 12 spaces.

    \b(?:DE(?: \d{3}){3}|DE ?[0-9]{9})(?!\d)
    

    Regex demo