Search code examples
regexregex-greedy

Regex optional everything separated by space or comma (city, state)


I am trying to get the street, city, state and zip from a non-well-formed list of addresses, everything but the "street" is optional sequentially. (I can have street, street+city, street+city+state, street+city+state+zip). Separators are either a comma + space, or space only.

So far, I have

^(?<STREET>.*?)(?<SEPARATOR1>(?: *-{1,2} *)|(?:, ?))(?<CITY>[a-z-' ]*)?((?<SEPARATOR2>(?: )|(?:, ))(?<STATE>AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY))?((?<SEPARATOR3>(?: )|(?:, ))(?<ZIP>[0-9]{5}(-[0-9]{4})?))?

I am having trouble to get a capture after the CITY capture if it's only separated by a space.

Test data:

123 Ave Ave - Hoquiam WA 98103
123 Ave Ave - Hoquiam, WA 98103
123 Ave Ave - Hoquiam, WA 98103-1345
123 Ave Ave - Hoquiam
123 Ave Ave - Ocean Shores WA
123 Ave Ave - Ocean Shores, WA
123 Ave Ave - D'ile, WA
123 Ave Ave

What am I doing wrong?

https://regex101.com/r/v476Gx/1


Solution

  • With some tweaking, following updated regex should work for you:

    ^(?<STREET>.*?)(?:(?<SEPARATOR1>(?: *-{1,2} *)|(?:, ?))(?<CITY>[a-z-' ]*?)?((?<SEPARATOR2>(?: )|(?:, ))(?<STATE>AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY))?((?<SEPARATOR3>(?: )|(?:, ))(?<ZIP>[0-9]{5}(?:-[0-9]{4})?))?)?$
    

    Updated RegEx Demo