Search code examples
pythonregexregular-language

Python parse inconsistent free form street names using regex


For data in the following structure I want to obtain the parsed street name details:

# streetname 1() refers to house number 1 with an empty () additional qualifier 

keyword_token: street name 4()
keyword_token: street-name 14()

keyword_token: streetname 123()keyword_token: streetname 123()
# why is it logged one message per line, but we get the address logged twice - sometimes??

keyword_token: streetname 9(7)keyword_token: streetname 9(7)
keyword_token: streetname 27()\r\n a lot more text and log messages in the free form text log - one messageper line  \n
    
keyword_token: street-name 1-23(BLOCK D HAUS 6)keyword_token: street-name 1-23(BLOCK H HAUS 2)keyword_token: street-name 1-23(BLOCK G HAUS 3)',
        
        

The ideall expected result is: 3 fields for each record:

  • street name
  • house number
  • additional qualifier (empty/NaN) if it is empty/missing

So far I experimented with the regex of: keyword_token(.*), but this is giving the whole line after the keyword token.

Complications:

  • I am only interested in the first match (not many) i..e only the first occurence of keyword_token:
  • the street name itself can be quite inconsistent (spaces, -) it will start after the keyword_token: and go until the (

edit: an example regex101 is found here https://regex101.com/r/ueEfNU/1

edit 2: also not numeric house numbers need to be supported.

keyword_token: street_name 32a()

Solution

  • You can use

    keyword_token:\s*(.*?)\s*(\d[a-zA-Z\d-]*)\(([^()]*)\)
    

    See the regex demo. Details:

    • keyword_token: - a fixed string
    • \s* - zero or more whitespaces
    • (.*?) - Group 1: any zero or more chars other than line break chars, as few as possible (due to *? lazy quantifier)
    • \s* - zero or more whitespaces
    • (\d[a-zA-Z\d-]*) - Group 2: a digit and then zero or more letters, digits or - char
    • \( - a ( char
    • ([^()]*) - Group 3: one or more chars other than ( and )
    • \) - a ) char.