I am trying to extract the street names from a street address. I just want to extract the street/house number at the beginning of the string, and extract everything after the last street suffix (RD, ST, DR, HWY, ect).
Two caveats:
If there are two street suffixes, I want to leave the first one, and strip out the second one (For example, South Pkwy Dr becomes South Pkwy, not just South).
If the street address ends in a number, I want to leave the suffix. For example, if the address is 123 County Rd 53, I want it to return County Rd 53. If the street address is State Rte 22, I want it to return State Rte 22.
Below is the sample data (input) and the desired output (output)
d <- tibble(input=c(
'505 BLACKBERRY',
'135 BEARDSLEY ST',
'15 HUNT CLUB DR',
'1223 STATE ROUTE 103',
'455 STATE RTE 43',
'206 COUNTY RD 4710',
'17 E 250TH ST',
'158 BALLINGER AVE SE',
'150 BALLINGER AVE S',
'18 BALLINGER AVE T' ,
'1272 ORANGE SUN TRL',
'291 S MORELAND BLVD',
'615 RUSSET WOOD LN',
'1165 MORROCCO CT',
'1321 S PKWY DR',
'250 COUNTY RD 25A S',
'22 SANSTONE RIDGE WAY',
'55070 MENDOZA TRL',
'1609 HUNTSMERE AVE DOWN',
'243 MISTY WOODS CV S',
'2292 BAYBERRY CMNS',
'16 KILDEER CRK',
'40 BEDFORD XING',
'4 LEXINGTON SQ',
'113 SPARROWS CRST',
'1082 MATHOM LNDG',
'1050 WILLOW RIDGE LOOP',
'660 REDTOP LOOP',
'8 MOUNT ROYAL LOOP',
'805 SIERRA OVAL',
'3012 NANTUCKET ROW',
'6 WOODROW AVE',
'943 DARROW PARK DR',
'743 BELVEDERE TER',
'189 WINCHESTER RD',
'19 WHITE OAK TRCE',
'890 BLACKJACK RD EXT',
'767 N EXCALIBUR DR',
'109 VININGS FOREST LN SE',
'508 E 141ST ST',
'85 ROSE LN ST SW'),
output= c(
'BLACKBERRY',
'BEARDSLEY',
'HUNT CLUB',
'STATE ROUTE 103',
'STATE RTE 43',
'COUNTY RD 4710',
'E 250TH',
'BALLINGER',
'BALLINGER',
'BALLINGER' ,
'ORANGE SUN',
'S MORELAND',
'RUSSET WOOD',
'MORROCCO',
'S PKWY',
'COUNTY',
'SANSTONE RIDGE WAY',
'MENDOZA',
'HUNTSMERE',
'MISTY WOODS',
'BAYBERRY',
'KILDEER',
'BEDFORD',
'LEXINGTON',
'SPARROWS',
'MATHOM',
'WILLOW RIDGE',
'REDTOP',
'MOUNT ROYAL',
'SIERRA',
'NANTUCKET',
'WOODROW',
'DARROW PARK',
'BELVEDERE',
'WINCHESTER',
'WHITE OAK',
'BLACKJACK',
'N EXCALIBUR',
'VININGS FOREST',
'E 141ST',
'ROSE LN'))
Side-by-side these two arrays are displayed as follows.
505 BLACKBERRY : BLACKBERRY
135 BEARDSLEY ST : BEARDSLEY
15 HUNT CLUB DR : HUNT CLUB
1223 STATE ROUTE 103 : STATE ROUTE 103
455 STATE RTE 43 : STATE RTE 43
206 COUNTY RD 4710 : COUNTY RD 4710
17 E 250TH ST : E 250TH
158 BALLINGER AVE SE : BALLINGER
150 BALLINGER AVE S : BALLINGER
18 BALLINGER AVE T : BALLINGER
1272 ORANGE SUN TRL : ORANGE SUN
291 S MORELAND BLVD : S MORELAND
615 RUSSET WOOD LN : RUSSET WOOD
1165 MORROCCO CT : MORROCCO
1321 S PKWY DR : S PKWY
250 COUNTY RD 25A S : COUNTY
22 SANSTONE RIDGE WAY : SANSTONE RIDGE WAY
55070 MENDOZA TRL : MENDOZA
1609 HUNTSMERE AVE DOWN : HUNTSMERE
243 MISTY WOODS CV S : MISTY WOODS
2292 BAYBERRY CMNS : BAYBERRY
16 KILDEER CRK : KILDEER
40 BEDFORD XING : BEDFORD
4 LEXINGTON SQ : LEXINGTON
113 SPARROWS CRST : SPARROWS
1082 MATHOM LNDG : MATHOM
1050 WILLOW RIDGE LOOP : WILLOW RIDGE
660 REDTOP LOOP : REDTOP
8 MOUNT ROYAL LOOP : MOUNT ROYAL
805 SIERRA OVAL : SIERRA
3012 NANTUCKET ROW : NANTUCKET
6 WOODROW AVE : WOODROW
943 DARROW PARK DR : DARROW PARK
743 BELVEDERE TER : BELVEDERE
189 WINCHESTER RD : WINCHESTER
19 WHITE OAK TRCE : WHITE OAK
890 BLACKJACK RD EXT : BLACKJACK
767 N EXCALIBUR DR : N EXCALIBUR
109 VININGS FOREST LN SE : VININGS FOREST
508 E 141ST ST : E 141ST
85 ROSE LN ST SW : ROSE LN
Here is what I have tried:
I stupidly used Regex 101 and didn't realize it was not compatible with R. I got it mostly working here before I had that realization. https://regex101.com/r/UdK6pB/1
I then tried to get this to work in R in multiple ways including this lovely bit of code:
d$input <- str_extract(d$input, "(?:(?<=[0-9]{1,5}\\b)).*\\b[[:digit:]]+$|(^[0-9]{1,5}\\b)(.*)(?:(?=\\bAVE|ST|DR|RD|LN|TRL|BLVD|CT|PKWY|JCT|SQ|HWY|WAY|CV|CMNS|CRK|XING|CRST|LNDG|LOOP|OVAL|ROW|TER|TRCE|RTE$))")
I have also tried something like this
d$output <- str_remove(d$input,"^[[:digit:]]+\\b")
d$output <- str_remove(d$output, "\\b['AVE'|'ST'|'DR'|'RD'|'LN'|'TRL'|'BLVD'|'CT'|'PKWY'|'JCT'|'SQ'|'HWY'|'WAY'|'CV'|'CMNS'|'CRK'|'XING'|'CRST'|'LNDG'|'LOOP'|'OVAL'|'ROW'|'TER'|'TRCE'|'RTE']$")
and then this
d$output <- sub("^[[:digit:]]+[[:space:]]", '', d$input, perl = TRUE)
d$output <- sub("[[:space:]]+[AVE|ST|DR|RD|LN|TRL|BLVD|CT|PKWY|JCT|SQ|HWY|WAY|CV|CMNS|CRK|XING|CRST|LNDG|LOOP|OVAL|ROW|TER|TRCE|RTE]$", '', d$output, perl=TRUE)
I am at my wits end and hope someone is willing and able to help me. Thank you.
Not a concise, elegant one, but a solution nonetheless (85 ROSE LN ST SW
was a bit tricky).
Just basic regex
thou.
Take a look.
The suffix pattern helper:
library(tidyverse)
# Suffix pattern
suffix <- c(
"AVE", "BLVD", "CMNS", "CRK", "CRST",
"CT", "CV", "DR", "LN", "LNDG",
"LOOP", "OVAL", "RD", "ROW", "SQ",
"ST", "TER", "TRCE", "TRL", "XING")
suffix <- paste0("\\b", suffix, "\\b")
suffix <- str_flatten(suffix, "|")
The code:
new_d <- d %>%
rowid_to_column("id") %>%
mutate(
.by = id,
# number = str_extract(input, "^\\d+"), # if you want it
my_output = str_remove(input, "^\\d+\\s+")) %>%
separate_rows(my_output, sep = "\\s") %>%
mutate(
.by = id,
index = cumsum(if_else(
str_detect(my_output, suffix) & !last(str_detect(my_output, "^\\d+$")),
1, 0))) %>%
filter(.by = id, index == 0 | index < max(index)) %>%
summarise(
.by = -c(my_output, index),
my_output = str_flatten(my_output, " "))
The output:
> new_d
# A tibble: 41 × 4
id input output my_output
<int> <chr> <chr> <chr>
1 1 505 BLACKBERRY BLACKBERRY BLACKBERRY
2 2 135 BEARDSLEY ST BEARDSLEY BEARDSLEY
3 3 15 HUNT CLUB DR HUNT CLUB HUNT CLUB
4 4 1223 STATE ROUTE 103 STATE ROUTE 103 STATE ROUTE 103
5 5 455 STATE RTE 43 STATE RTE 43 STATE RTE 43
6 6 206 COUNTY RD 4710 COUNTY RD 4710 COUNTY RD 4710
7 7 17 E 250TH ST E 250TH E 250TH
8 8 158 BALLINGER AVE SE BALLINGER BALLINGER
9 9 150 BALLINGER AVE S BALLINGER BALLINGER
10 10 18 BALLINGER AVE T BALLINGER BALLINGER
11 11 1272 ORANGE SUN TRL ORANGE SUN ORANGE SUN
12 12 291 S MORELAND BLVD S MORELAND S MORELAND
13 13 615 RUSSET WOOD LN RUSSET WOOD RUSSET WOOD
14 14 1165 MORROCCO CT MORROCCO MORROCCO
15 15 1321 S PKWY DR S PKWY S PKWY
16 16 250 COUNTY RD 25A S COUNTY COUNTY
17 17 22 SANSTONE RIDGE WAY SANSTONE RIDGE WAY SANSTONE RIDGE WAY
18 18 55070 MENDOZA TRL MENDOZA MENDOZA
19 19 1609 HUNTSMERE AVE DOWN HUNTSMERE HUNTSMERE
20 20 243 MISTY WOODS CV S MISTY WOODS MISTY WOODS
21 21 2292 BAYBERRY CMNS BAYBERRY BAYBERRY
22 22 16 KILDEER CRK KILDEER KILDEER
23 23 40 BEDFORD XING BEDFORD BEDFORD
24 24 4 LEXINGTON SQ LEXINGTON LEXINGTON
25 25 113 SPARROWS CRST SPARROWS SPARROWS
26 26 1082 MATHOM LNDG MATHOM MATHOM
27 27 1050 WILLOW RIDGE LOOP WILLOW RIDGE WILLOW RIDGE
28 28 660 REDTOP LOOP REDTOP REDTOP
29 29 8 MOUNT ROYAL LOOP MOUNT ROYAL MOUNT ROYAL
30 30 805 SIERRA OVAL SIERRA SIERRA
31 31 3012 NANTUCKET ROW NANTUCKET NANTUCKET
32 32 6 WOODROW AVE WOODROW WOODROW
33 33 943 DARROW PARK DR DARROW PARK DARROW PARK
34 34 743 BELVEDERE TER BELVEDERE BELVEDERE
35 35 189 WINCHESTER RD WINCHESTER WINCHESTER
36 36 19 WHITE OAK TRCE WHITE OAK WHITE OAK
37 37 890 BLACKJACK RD EXT BLACKJACK BLACKJACK
38 38 767 N EXCALIBUR DR N EXCALIBUR N EXCALIBUR
39 39 109 VININGS FOREST LN SE VININGS FOREST VININGS FOREST
40 40 508 E 141ST ST E 141ST E 141ST
41 41 85 ROSE LN ST SW ROSE LN ROSE LN
That's it.
Edit. Almost forgot:
# toy data
d <- tibble::tribble(
~input, ~output,
"505 BLACKBERRY", "BLACKBERRY",
"135 BEARDSLEY ST", "BEARDSLEY",
"15 HUNT CLUB DR", "HUNT CLUB",
"1223 STATE ROUTE 103", "STATE ROUTE 103",
"455 STATE RTE 43", "STATE RTE 43",
"206 COUNTY RD 4710", "COUNTY RD 4710",
"17 E 250TH ST", "E 250TH",
"158 BALLINGER AVE SE", "BALLINGER",
"150 BALLINGER AVE S", "BALLINGER",
"18 BALLINGER AVE T", "BALLINGER",
"1272 ORANGE SUN TRL", "ORANGE SUN",
"291 S MORELAND BLVD", "S MORELAND",
"615 RUSSET WOOD LN", "RUSSET WOOD",
"1165 MORROCCO CT", "MORROCCO",
"1321 S PKWY DR", "S PKWY",
"250 COUNTY RD 25A S", "COUNTY",
"22 SANSTONE RIDGE WAY", "SANSTONE RIDGE WAY",
"55070 MENDOZA TRL", "MENDOZA",
"1609 HUNTSMERE AVE DOWN", "HUNTSMERE",
"243 MISTY WOODS CV S", "MISTY WOODS",
"2292 BAYBERRY CMNS", "BAYBERRY",
"16 KILDEER CRK", "KILDEER",
"40 BEDFORD XING", "BEDFORD",
"4 LEXINGTON SQ", "LEXINGTON",
"113 SPARROWS CRST", "SPARROWS",
"1082 MATHOM LNDG", "MATHOM",
"1050 WILLOW RIDGE LOOP", "WILLOW RIDGE",
"660 REDTOP LOOP", "REDTOP",
"8 MOUNT ROYAL LOOP", "MOUNT ROYAL",
"805 SIERRA OVAL", "SIERRA",
"3012 NANTUCKET ROW", "NANTUCKET",
"6 WOODROW AVE", "WOODROW",
"943 DARROW PARK DR", "DARROW PARK",
"743 BELVEDERE TER", "BELVEDERE",
"189 WINCHESTER RD", "WINCHESTER",
"19 WHITE OAK TRCE", "WHITE OAK",
"890 BLACKJACK RD EXT", "BLACKJACK",
"767 N EXCALIBUR DR", "N EXCALIBUR",
"109 VININGS FOREST LN SE", "VININGS FOREST",
"508 E 141ST ST", "E 141ST",
"85 ROSE LN ST SW", "ROSE LN"
)
Created on 2024-05-08 with reprex v2.1.0