Search code examples
pythonpython-re

Remove duplicate pattern (dutch number and postal code) REGEX


I would like to remove duplicates with regex in python but I'm struggling a bit.

text= 'H. Gerhardstraat 77 1502 CC 77 1502 CC Zaandam'
text = re.sub(r'\b(\d{,4})\s(\d{4})\s([A-Za-z]){2}\b', r'\1', text) 

print(text)

I would like to get 'H. Gerhardstraat 77 1502 CC Zaandam'

I now get this : 'H. Gerhardstraat 77 77 Zaandam'


Solution

  • Use the forth argument count of re.sub(pattern, repl, string, count=0, flags=0) as follows:

    text= 'H. Gerhardstraat 77 1502 CC 77 1502 CC Zaandam'
    pattern = r'(\d{0,4}\s\d{4}\s[A-Za-z]{2}\s+)'
    count  = len(re.findall(pattern, text))
    
    if count > 1:
        text = re.sub(pattern, '', text, count -1)