Search code examples
rregexgeolocationpcrestreet-address

R (regex) - removing apartment, unit, and other words from end of address


I have a large dataset of addresses that I plan to geocode in ArcGIS (Google geolocating is too expensive). Examples of the addresses are below.

9999 ST PAUL ST BSMT

GARRISON BL & BOARMAN AVENUE REAR

1234 MAIN STREET 123

1234 MAIN ST UNIT1

ArcGIS doesn't recognize addresses that include units and other words at the end. So I want to remove these words so that it looks like the below.

9999 ST PAUL ST

GARRISON BL & BOARMAN AVENUE

1234 MAIN STREET

1234 MAIN ST

The key challenges include

  1. ST is used both to abbreviate streets and indicate "SAINT" in street names.
  2. Addresses end in many different indicators such as STREET and AVENUE
  3. There are intersections (indicated with &) that might include indicators like ST and AVENUE twice.

Using R, I'm attempting to apply the sub() function to solve the problem but I have not had success. Below is my latest attempt.

sub("(.*)ST","\\1",df$Address,perl=T)

I know that many questions ask similar questions but none address this problem directly and I suspect it is relevant to other users.


Solution

  • Although I feel removing the last word should work for you, but just to be little safer, you can use this regex to retain what you want and discard what you don't want in safer way.

    (.*(?:ST|AVENUE|STREET)\b).*
    

    Here, .*(?:ST|AVENUE|STREET)\b captures your intended data by capturing everything from start in greedy manner and only stop when it encounters any of those words ST or AVENUE or STREET (i.e. last occurrence of those words), and whatever comes after that, will be discarded which is what you wanted. In your current case you only have one word but it can discard more than one word or indeed anything that occurs after those specific words. Intended data gets captured in group 1 so just replace that with \1

    So instead of this,

    sub("(.*)ST","\\1",df$Address,perl=T)
    

    try this,

    sub("(.*(?:ST|AVENUE|STREET)\b).*","\\1",df$Address,perl=T)
    

    See this demo