I have a large dataset of addresses that I plan to geocode in ArcGIS (Google geolocating is too expensive). Examples of the addresses are below.
9999 ST PAUL ST BSMT
GARRISON BL & BOARMAN AVENUE REAR
1234 MAIN STREET 123
1234 MAIN ST UNIT1
ArcGIS doesn't recognize addresses that include units and other words at the end. So I want to remove these words so that it looks like the below.
9999 ST PAUL ST
GARRISON BL & BOARMAN AVENUE
1234 MAIN STREET
1234 MAIN ST
The key challenges include
ST
is used both to abbreviate streets and indicate "SAINT" in street names. STREET
and AVENUE
&
) that might include indicators like ST
and AVENUE
twice.Using R, I'm attempting to apply the sub()
function to solve the problem but I have not had success. Below is my latest attempt.
sub("(.*)ST","\\1",df$Address,perl=T)
I know that many questions ask similar questions but none address this problem directly and I suspect it is relevant to other users.
Although I feel removing the last word should work for you, but just to be little safer, you can use this regex to retain what you want and discard what you don't want in safer way.
(.*(?:ST|AVENUE|STREET)\b).*
Here, .*(?:ST|AVENUE|STREET)\b
captures your intended data by capturing everything from start in greedy manner and only stop when it encounters any of those words ST or AVENUE or STREET (i.e. last occurrence of those words), and whatever comes after that, will be discarded which is what you wanted. In your current case you only have one word but it can discard more than one word or indeed anything that occurs after those specific words. Intended data gets captured in group 1 so just replace that with \1
So instead of this,
sub("(.*)ST","\\1",df$Address,perl=T)
try this,
sub("(.*(?:ST|AVENUE|STREET)\b).*","\\1",df$Address,perl=T)