Search code examples
rregexstringr

Extract Street Name from Address in R


I am trying to extract the street names from a street address. I just want to extract the street/house number at the beginning of the string, and extract everything after the last street suffix (RD, ST, DR, HWY, ect).

Two caveats:

  1. If there are two street suffixes, I want to leave the first one, and strip out the second one (For example, South Pkwy Dr becomes South Pkwy, not just South).

  2. If the street address ends in a number, I want to leave the suffix. For example, if the address is 123 County Rd 53, I want it to return County Rd 53. If the street address is State Rte 22, I want it to return State Rte 22.

Below is the sample data (input) and the desired output (output)

d <- tibble(input=c(
  '505 BLACKBERRY',
  '135 BEARDSLEY ST',
  '15 HUNT CLUB DR',
  '1223 STATE ROUTE 103',
  '455 STATE RTE 43',
  '206 COUNTY RD 4710',
  '17 E 250TH ST',
  '158 BALLINGER AVE SE',
  '150 BALLINGER AVE S',
  '18 BALLINGER AVE T' ,
  '1272 ORANGE SUN TRL',
  '291 S MORELAND BLVD',
  '615 RUSSET WOOD LN',
  '1165 MORROCCO CT',
  '1321 S PKWY DR',
  '250 COUNTY RD 25A S',
  '22 SANSTONE RIDGE WAY',
  '55070 MENDOZA TRL',
  '1609 HUNTSMERE AVE DOWN',
  '243 MISTY WOODS CV S',
  '2292 BAYBERRY CMNS',
  '16 KILDEER CRK',
  '40 BEDFORD XING',
  '4 LEXINGTON SQ',
  '113 SPARROWS CRST',
  '1082 MATHOM LNDG',
  '1050 WILLOW RIDGE LOOP',
  '660 REDTOP LOOP',
  '8 MOUNT ROYAL LOOP',
  '805 SIERRA OVAL',
  '3012 NANTUCKET ROW',
  '6 WOODROW AVE',
  '943 DARROW PARK DR',
  '743 BELVEDERE TER',
  '189 WINCHESTER RD',
  '19 WHITE OAK TRCE',
  '890 BLACKJACK RD EXT',
  '767 N EXCALIBUR DR',
  '109 VININGS FOREST LN SE',
  '508 E 141ST ST',
  '85 ROSE LN ST SW'),
  output= c(
  'BLACKBERRY',
  'BEARDSLEY',
  'HUNT CLUB',
  'STATE ROUTE 103',
  'STATE RTE 43',
  'COUNTY RD 4710',
  'E 250TH',
  'BALLINGER',
  'BALLINGER',
  'BALLINGER' ,
  'ORANGE SUN',
  'S MORELAND',
  'RUSSET WOOD',
  'MORROCCO',
  'S PKWY',
  'COUNTY',
  'SANSTONE RIDGE WAY',
  'MENDOZA',
  'HUNTSMERE',
  'MISTY WOODS',
  'BAYBERRY',
  'KILDEER',
  'BEDFORD',
  'LEXINGTON',
  'SPARROWS',
  'MATHOM',
  'WILLOW RIDGE',
  'REDTOP',
  'MOUNT ROYAL',
  'SIERRA',
  'NANTUCKET',
  'WOODROW',
  'DARROW PARK',
  'BELVEDERE',
  'WINCHESTER',
  'WHITE OAK',
  'BLACKJACK',
  'N EXCALIBUR',
  'VININGS FOREST',
  'E 141ST',
  'ROSE LN'))

Side-by-side these two arrays are displayed as follows.

505 BLACKBERRY            :  BLACKBERRY
135 BEARDSLEY ST          :  BEARDSLEY
15 HUNT CLUB DR           :  HUNT CLUB
1223 STATE ROUTE 103      :  STATE ROUTE 103
455 STATE RTE 43          :  STATE RTE 43
206 COUNTY RD 4710        :  COUNTY RD 4710
17 E 250TH ST             :  E 250TH
158 BALLINGER AVE SE      :  BALLINGER
150 BALLINGER AVE S       :  BALLINGER
18 BALLINGER AVE T        :  BALLINGER
1272 ORANGE SUN TRL       :  ORANGE SUN
291 S MORELAND BLVD       :  S MORELAND
615 RUSSET WOOD LN        :  RUSSET WOOD
1165 MORROCCO CT          :  MORROCCO
1321 S PKWY DR            :  S PKWY
250 COUNTY RD 25A S       :  COUNTY
22 SANSTONE RIDGE WAY     :  SANSTONE RIDGE WAY
55070 MENDOZA TRL         :  MENDOZA
1609 HUNTSMERE AVE DOWN   :  HUNTSMERE
243 MISTY WOODS CV S      :  MISTY WOODS
2292 BAYBERRY CMNS        :  BAYBERRY
16 KILDEER CRK            :  KILDEER
40 BEDFORD XING           :  BEDFORD
4 LEXINGTON SQ            :  LEXINGTON
113 SPARROWS CRST         :  SPARROWS
1082 MATHOM LNDG          :  MATHOM
1050 WILLOW RIDGE LOOP    :  WILLOW RIDGE
660 REDTOP LOOP           :  REDTOP
8 MOUNT ROYAL LOOP        :  MOUNT ROYAL
805 SIERRA OVAL           :  SIERRA
3012 NANTUCKET ROW        :  NANTUCKET
6 WOODROW AVE             :  WOODROW
943 DARROW PARK DR        :  DARROW PARK
743 BELVEDERE TER         :  BELVEDERE
189 WINCHESTER RD         :  WINCHESTER
19 WHITE OAK TRCE         :  WHITE OAK
890 BLACKJACK RD EXT      :  BLACKJACK
767 N EXCALIBUR DR        :  N EXCALIBUR
109 VININGS FOREST LN SE  :  VININGS FOREST
508 E 141ST ST            :  E 141ST
85 ROSE LN ST SW          :  ROSE LN

Here is what I have tried:

I stupidly used Regex 101 and didn't realize it was not compatible with R. I got it mostly working here before I had that realization. https://regex101.com/r/UdK6pB/1

I then tried to get this to work in R in multiple ways including this lovely bit of code:

d$input <- str_extract(d$input, "(?:(?<=[0-9]{1,5}\\b)).*\\b[[:digit:]]+$|(^[0-9]{1,5}\\b)(.*)(?:(?=\\bAVE|ST|DR|RD|LN|TRL|BLVD|CT|PKWY|JCT|SQ|HWY|WAY|CV|CMNS|CRK|XING|CRST|LNDG|LOOP|OVAL|ROW|TER|TRCE|RTE$))")

I have also tried something like this

d$output <- str_remove(d$input,"^[[:digit:]]+\\b")
d$output <- str_remove(d$output, "\\b['AVE'|'ST'|'DR'|'RD'|'LN'|'TRL'|'BLVD'|'CT'|'PKWY'|'JCT'|'SQ'|'HWY'|'WAY'|'CV'|'CMNS'|'CRK'|'XING'|'CRST'|'LNDG'|'LOOP'|'OVAL'|'ROW'|'TER'|'TRCE'|'RTE']$")

and then this

d$output <- sub("^[[:digit:]]+[[:space:]]", '', d$input, perl = TRUE)
d$output <- sub("[[:space:]]+[AVE|ST|DR|RD|LN|TRL|BLVD|CT|PKWY|JCT|SQ|HWY|WAY|CV|CMNS|CRK|XING|CRST|LNDG|LOOP|OVAL|ROW|TER|TRCE|RTE]$", '', d$output, perl=TRUE)

I am at my wits end and hope someone is willing and able to help me. Thank you.


Solution

  • Not a concise, elegant one, but a solution nonetheless (85 ROSE LN ST SW was a bit tricky).
    Just basic regex thou.
    Take a look.

    The suffix pattern helper:

    library(tidyverse)
    
    # Suffix pattern
    suffix <- c(
       "AVE", "BLVD", "CMNS", "CRK", "CRST", 
        "CT",   "CV",   "DR",  "LN", "LNDG", 
      "LOOP", "OVAL",   "RD", "ROW",   "SQ", 
        "ST",  "TER", "TRCE", "TRL", "XING")
    
    suffix <- paste0("\\b", suffix, "\\b")
    suffix <- str_flatten(suffix, "|")
    

    The code:

    new_d <- d %>% 
      rowid_to_column("id") %>% 
      mutate(
        .by = id, 
        # number = str_extract(input, "^\\d+"), # if you want it
        my_output = str_remove(input, "^\\d+\\s+")) %>% 
      
      separate_rows(my_output, sep = "\\s") %>% 
      
      mutate(
        .by = id, 
        index = cumsum(if_else(
          str_detect(my_output, suffix) & !last(str_detect(my_output, "^\\d+$")),
          1, 0))) %>% 
      
      filter(.by = id, index == 0 | index < max(index)) %>% 
    
      summarise(
        .by = -c(my_output, index),
        my_output = str_flatten(my_output, " "))
    

    The output:

    > new_d
    # A tibble: 41 × 4
          id input                    output             my_output         
       <int> <chr>                    <chr>              <chr>             
     1     1 505 BLACKBERRY           BLACKBERRY         BLACKBERRY        
     2     2 135 BEARDSLEY ST         BEARDSLEY          BEARDSLEY         
     3     3 15 HUNT CLUB DR          HUNT CLUB          HUNT CLUB         
     4     4 1223 STATE ROUTE 103     STATE ROUTE 103    STATE ROUTE 103   
     5     5 455 STATE RTE 43         STATE RTE 43       STATE RTE 43      
     6     6 206 COUNTY RD 4710       COUNTY RD 4710     COUNTY RD 4710    
     7     7 17 E 250TH ST            E 250TH            E 250TH           
     8     8 158 BALLINGER AVE SE     BALLINGER          BALLINGER         
     9     9 150 BALLINGER AVE S      BALLINGER          BALLINGER         
    10    10 18 BALLINGER AVE T       BALLINGER          BALLINGER         
    11    11 1272 ORANGE SUN TRL      ORANGE SUN         ORANGE SUN        
    12    12 291 S MORELAND BLVD      S MORELAND         S MORELAND        
    13    13 615 RUSSET WOOD LN       RUSSET WOOD        RUSSET WOOD       
    14    14 1165 MORROCCO CT         MORROCCO           MORROCCO          
    15    15 1321 S PKWY DR           S PKWY             S PKWY            
    16    16 250 COUNTY RD 25A S      COUNTY             COUNTY            
    17    17 22 SANSTONE RIDGE WAY    SANSTONE RIDGE WAY SANSTONE RIDGE WAY
    18    18 55070 MENDOZA TRL        MENDOZA            MENDOZA           
    19    19 1609 HUNTSMERE AVE DOWN  HUNTSMERE          HUNTSMERE         
    20    20 243 MISTY WOODS CV S     MISTY WOODS        MISTY WOODS       
    21    21 2292 BAYBERRY CMNS       BAYBERRY           BAYBERRY          
    22    22 16 KILDEER CRK           KILDEER            KILDEER           
    23    23 40 BEDFORD XING          BEDFORD            BEDFORD           
    24    24 4 LEXINGTON SQ           LEXINGTON          LEXINGTON         
    25    25 113 SPARROWS CRST        SPARROWS           SPARROWS          
    26    26 1082 MATHOM LNDG         MATHOM             MATHOM            
    27    27 1050 WILLOW RIDGE LOOP   WILLOW RIDGE       WILLOW RIDGE      
    28    28 660 REDTOP LOOP          REDTOP             REDTOP            
    29    29 8 MOUNT ROYAL LOOP       MOUNT ROYAL        MOUNT ROYAL       
    30    30 805 SIERRA OVAL          SIERRA             SIERRA            
    31    31 3012 NANTUCKET ROW       NANTUCKET          NANTUCKET         
    32    32 6 WOODROW AVE            WOODROW            WOODROW           
    33    33 943 DARROW PARK DR       DARROW PARK        DARROW PARK       
    34    34 743 BELVEDERE TER        BELVEDERE          BELVEDERE         
    35    35 189 WINCHESTER RD        WINCHESTER         WINCHESTER        
    36    36 19 WHITE OAK TRCE        WHITE OAK          WHITE OAK         
    37    37 890 BLACKJACK RD EXT     BLACKJACK          BLACKJACK         
    38    38 767 N EXCALIBUR DR       N EXCALIBUR        N EXCALIBUR       
    39    39 109 VININGS FOREST LN SE VININGS FOREST     VININGS FOREST    
    40    40 508 E 141ST ST           E 141ST            E 141ST           
    41    41 85 ROSE LN ST SW         ROSE LN            ROSE LN 
    

    That's it.


    Edit. Almost forgot:

    # toy data
    d <- tibble::tribble(
                          ~input,              ~output,
                "505 BLACKBERRY",         "BLACKBERRY",
              "135 BEARDSLEY ST",          "BEARDSLEY",
               "15 HUNT CLUB DR",          "HUNT CLUB",
          "1223 STATE ROUTE 103",    "STATE ROUTE 103",
              "455 STATE RTE 43",       "STATE RTE 43",
            "206 COUNTY RD 4710",     "COUNTY RD 4710",
                 "17 E 250TH ST",            "E 250TH",
          "158 BALLINGER AVE SE",          "BALLINGER",
           "150 BALLINGER AVE S",          "BALLINGER",
            "18 BALLINGER AVE T",          "BALLINGER",
           "1272 ORANGE SUN TRL",         "ORANGE SUN",
           "291 S MORELAND BLVD",         "S MORELAND",
            "615 RUSSET WOOD LN",        "RUSSET WOOD",
              "1165 MORROCCO CT",           "MORROCCO",
                "1321 S PKWY DR",             "S PKWY",
           "250 COUNTY RD 25A S",             "COUNTY",
         "22 SANSTONE RIDGE WAY", "SANSTONE RIDGE WAY",
             "55070 MENDOZA TRL",            "MENDOZA",
       "1609 HUNTSMERE AVE DOWN",          "HUNTSMERE",
          "243 MISTY WOODS CV S",        "MISTY WOODS",
            "2292 BAYBERRY CMNS",           "BAYBERRY",
                "16 KILDEER CRK",            "KILDEER",
               "40 BEDFORD XING",            "BEDFORD",
                "4 LEXINGTON SQ",          "LEXINGTON",
             "113 SPARROWS CRST",           "SPARROWS",
              "1082 MATHOM LNDG",             "MATHOM",
        "1050 WILLOW RIDGE LOOP",       "WILLOW RIDGE",
               "660 REDTOP LOOP",             "REDTOP",
            "8 MOUNT ROYAL LOOP",        "MOUNT ROYAL",
               "805 SIERRA OVAL",             "SIERRA",
            "3012 NANTUCKET ROW",          "NANTUCKET",
                 "6 WOODROW AVE",            "WOODROW",
            "943 DARROW PARK DR",        "DARROW PARK",
             "743 BELVEDERE TER",          "BELVEDERE",
             "189 WINCHESTER RD",         "WINCHESTER",
             "19 WHITE OAK TRCE",          "WHITE OAK",
          "890 BLACKJACK RD EXT",          "BLACKJACK",
            "767 N EXCALIBUR DR",        "N EXCALIBUR",
      "109 VININGS FOREST LN SE",     "VININGS FOREST",
                "508 E 141ST ST",            "E 141ST",
              "85 ROSE LN ST SW",            "ROSE LN"
      )
    

    Created on 2024-05-08 with reprex v2.1.0