I have hundreds of addresses in a data frame variable and need to extract zip codes from them. Some of the addresses contain more than one city, each with a zip code. Here is a mock example of a data frame and R code that extracts the zip codes.
require(qdapRegex)
require(stringr)
df <- data.frame(address = c("Walnut; 94596, Ontario, 91761, Beach, CA 90071", "Irvine Cal 92164"), var2 = "text")
df$zip.Rinker <- sapply(df$address, FUN = rm_zip, extract=TRUE)
The rm_zip
function from Tyler Rinker's qdapRegex
package extracts all the zip codes and puts them in list if there is more than one.
> df
address var2 zip.Rinker
1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 94596, 91761, 90071
2 Irvine Cal 92164 text 92164
How can R create a new row for each zip code that is in row 1 under zip.Rinker? Something like the following would be ideal. Note, there will be dozens of addresses that have multiple zip codes, so I am hoping for a solution that doesn't require manual steps.
address var2 zip.Rinker
1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 94596
2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 91761
3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 90071
4 Irvine Cal 92164 text 92164
Thank you for time.
PS Using stringr
, this code extracts zip codes and presents the same challenge.
df$zip.stringr <- str_extract_all(string = df$address, pattern = "\\d{5}")
You could do:
data.frame(rep(df$address, sapply(df$zip.Rinker, length)), unlist(df$zip.Rinker)
## rep.df.address..sapply.df.zip.Rinker..length.. unlist.df.zip.Rinker.
## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 94596
## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 91761
## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 90071
## 4 Irvine Cal 92164 92164
But note that rm_zip
is already vectorized and pretty speedy as it wraps the stringi
package. So no need for sapply
. Here's an approach that makes the code much more condensed using qdapTools
's list2df
that takes a named list
of vectors and turns them into a data.frame
.
library(qdapTools)
list2df(setNames(rm_zip(df$address, extract=TRUE), df$address), "zip", "address")[, 2:1]
## address zip
## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 94596
## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 91761
## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 90071
## 4 Irvine Cal 92164 92164
And I like the magrittr
framework for nested functions so here's that:
library(qdapTools)
library(magrittr)
df$address %>%
rm_zip(extract=TRUE) %>%
setNames(df$address) %>%
list2df("zip", "address") %>%
`[`(, 2:1)