Search code examples
rregexdataframestreet-address

Remove a portion of a randomized string over an entire dataframe column in R


Need help removing random text in a string that appears before an address (data set has ~5000 observations). Dataframe test2$address reads as follows:

addresses <- c(
  "140 National Plz Oxon Hill, MD 20745",
  "6324 Windsor Mill Rd Gwynn Oak, MD 21207",
  "23030 Indian Creek Dr Sterling, VA 20166",
  "Located in Reston Town Center 18882 Explorer St Reston, VA 20190"
)

I want it to spit out all addresses in a common format:

[885] "23030 Indian Creek Dr Sterling, VA 20166" 
[886] "18882 Explorer St Reston, VA 20190"

Not sure how to go about doing this as there is no specific pattern to the text that comes before the address number.


Solution

  • If you know that the address portion you want will always start with digits, and the part you want to remove will be text, then you can use this:

    sub(".*?(\\d+)", "\\1", x)
    

    Output:

    [1] "140 National Plz Oxon Hill, MD 20745"    
    [2] "6324 Windsor Mill Rd Gwynn Oak, MD 21207"
    [3] "23030 Indian Creek Dr Sterling, VA 20166"
    [4] "18882 Explorer St Reston, VA 20190"
    

    What this does is remove everything (.*) before the first (?) digit series (\\d+).

    Sample data:

    x <- c("140 National Plz Oxon Hill, MD 20745",
           "6324 Windsor Mill Rd Gwynn Oak, MD 21207",
           "23030 Indian Creek Dr Sterling, VA 20166",
           "Located in Reston Town Center 18882 Explorer St Reston, VA 20190")