Search code examples
rstreet-address

Standardization of Address Parts


I've got a large data frame in R consisting millions of unique addresses in a column (like 725 E 8TH ST).

In an effort to standardize addresses, I want to replace all the short street suffix abbreviations to their long form (changing E to East and ST to Street).

I am not looking to validate the addresses.

I found a package called campfin that has a tibble of many predefined abbreviations and their long forms (see usps_street).

I've tried to use the normal_address and the expand_abbrev functions to do this, but have been unsuccessful. Specifically, I tried normal_address(df$Address, abbs = usps_street) and expand_abbrev(df$Address, abb = usps_street). However, these are not replacing the abbreviations and the Address column is remaining unchanged.

Does anyone know how to properly use these functions? Or are there any suggestions for replacing these things? I don't want to defined short-forms (like abb = c("ST" = "STREET") ) because there are too many of them. Thus, I thought usps_street would be useful.

Here is a snippet of some of the address I'm looking at:

structure(list(Address = c("2771 NORTH THOMPSON RD NE", "10 BROOKWAY COURT", 
"1000 WESTCLIFF AVE", "10009 NORTHEAST 145TH ST", "10021 PINECREST DRIVE", 
"1003 WEST VICTORIA STREET"), Unit = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    City = c("ATLANTA", "MANSFIELD", "SAGINAW", "JONES", "PROVIDENCE VILLAGE", 
    "BRADY"), Country = c(NA, "US", "US", "US", "US", "US"), 
    Zip = c(30319, 76063, 76179, 73049, 76227, 76825), State = c("GA", 
    "TX", "TX", "OK", "TX", "TX")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

Solution

  • It looks like you're trying to do the inverse operation that the usps_street data frame was created for. For example, look at the abbreviations for "ST":

    library(dplyr)
    library(campfin)
    filter(usps_street, abb == "ST")
    
    # # A tibble: 3 × 2
    #   full   abb  
    #   <chr>  <chr>
    # 1 STR    ST   
    # 2 STREET ST   
    # 3 STRT   ST 
    

    This is designed to normalise in the following ways:

    STR -> ST
    STREET -> ST
    STRT -> ST
    

    However, you want to go the other way, meaning you have to select which of the longer values to map to.

    I am going to assume that you want the longest value, so STREET in this case. We need to create a data frame where the mapping is:

    ST -> STREET
    STR -> STREET
    STRT -> STREET
    

    We can do this fairly straightforwardly:

    abbs <- usps_street |>
        mutate(
            num_char = nchar(full)
        ) |>
        group_by(abb) |>
        mutate(
            abb = if_else(num_char == max(num_char), abb, full),
            # take the first longest if it's a tie
            full = full[ num_char == max(num_char)][1]
        )
    

    For STREET this looks like:

    abbs  |>
        filter(full == "STREET")
    
      full   abb   num_char
      <chr>  <chr>    <int>
    1 STREET STR          3
    2 STREET ST           6
    3 STREET STRT         4
    

    You can then use the campfin::expand_abbrev() function:

    dat |>
        mutate(
            address_expanded = expand_abbrev(
                Address, 
                abb = setNames(abbs$full, abbs$abb)
            )
        )
    
    # A tibble: 6 × 7
      Address                   Unit  City               Country   Zip State address_expanded                  
      <chr>                     <chr> <chr>              <chr>   <dbl> <chr> <chr>                             
    1 2771 NORTH THOMPSON RD NE NA    ATLANTA            NA      30319 GA    2771 NORTH THOMPSON ROAD NORTHEAST
    2 10 BROOKWAY COURT         NA    MANSFIELD          US      76063 TX    10 BROOKWAY COURT                 
    3 1000 WESTCLIFF AVE        NA    SAGINAW            US      76179 TX    1000 WESTCLIFF AVENUE             
    4 10009 NORTHEAST 145TH ST  NA    JONES              US      73049 OK    10009 NORTHEAST 145TH STREET      
    5 10021 PINECREST DRIVE     NA    PROVIDENCE VILLAGE US      76227 TX    10021 PINECREST DRIVE             
    6 1003 WEST VICTORIA STREET NA    BRADY              US      76825 TX    1003 WEST VICTORIA STREET