Search code examples
regexregex-group

Regex to extract words except optional final word


I need a regex to extract building names from a list. I'm passing the text and a regex to a framework that does the parsing, so I really want to try to solve this with a regex, not code.

The building name is always all caps and preceded by "Building:" then followed by any of (a) a number, (b) the word "UNIT" in all caps, or (c) any mixed case word. Thus I want to get BUILDING ONE as the result from all of the following except the last row, which should return nothing:

Building: BUILDING ONE 15 [building name followed by unit number]
Optional preceding text Building: BUILDING ONE 15 [preceding text, then building name followed unit number]
Building: BUILDING ONE UNIT 15 [building name followed by word UNIT and unit number]
Building: BUILDING ONE Floor 2 [building name followed by mixed case word]
Grounds: OPEN SPACE WEST Section 3 [not a building - return nothing]

I feel like I know this, but having a brain block. The closest I am right now is ^.*Building:\s([A-Z+\s*]*).* which for the samples above returns

BUILDING ONE
BUILDING ONE
BUILDING ONE UNIT
BUILDING ONE F

The application doing the parsing is written in Python, but as mentioned above, I'm just passing in the regex and data.


Solution

  • You could use this regex:

    (?<=Building: )[A-Z]+(?: (?!UNIT\b)[A-Z]+\b)*
    

    This matches:

    • (?<=Building: ) : positive lookbehind for Building:
    • [A-Z]+ : an uppercase word
    • (?: (?!UNIT\b)[A-Z]+\b)* : zero or more uppercase words that are not UNIT

    Demo on regex101

    If you're using a flavour of regex that doesn't support lookbehinds, you could use this similar regex, which captures the building name in group 1:

    \bBuilding: ([A-Z]+(?: (?!UNIT\b)[A-Z]+\b)*)
    

    Demo on regex101