Search code examples
rpurrrgeocoding

split large list of addresses and feed batches into geocoder


Say I have the following 20 addresses, and I want to split the list into 4 groups of 5 addresses each, and feed each group sequentially into a geocoder.

library(tidyverse)

df <- tibble::tribble(
              ~num_street,           ~city, ~sate, ~zip_code,
        "976 FAIRVIEW DR",   "SPRINGFIELD",  "OR",    97477L,
          "19843 HWY 213",   "OREGON CITY",  "OR",    97045L,
            "402 CARL ST",         "DRAIN",  "OR",    97435L,
           "304 WATER ST",        "WESTON",  "OR",    97886L,
   "5054 TECHNOLOGY LOOP",     "CORVALLIS",  "OR",    97333L,
         "3401 YACHT AVE",  "LINCOLN CITY",  "OR",    97367L,
      "135 ROOSEVELT AVE",          "BEND",  "OR",    97702L,
         "3631 FENWAY ST",  "FOREST GROVE",  "OR",    97116L,
       "92250 HILLTOP LN",      "COQUILLE",  "OR",    97423L,
          "6920 92ND AVE",        "TIGARD",  "OR",    97223L,
          "591 LAUREL ST", "JUNCTION CITY",  "OR",    97448L,
   "32035 LYNX HOLLOW RD",      "CRESWELL",  "OR",    97426L,
          "6280 ASTER ST",   "SPRINGFIELD",  "OR",    97478L,
      "17533 VANGUARD LN",     "BEAVERTON",  "OR",    97007L,
      "59937 CHEYENNE RD",          "BEND",  "OR",    97702L,
          "2232 42ND AVE",         "SALEM",  "OR",    97317L,
         "3100 TURNER RD",         "SALEM",  "OR",    97302L,
       "3495 CHAMBERS ST",        "EUGENE",  "OR",    97405L,
          "585 WINTER ST",         "SALEM",  "OR",    97301L,
        "23985 VAUGHN RD",        "VENETA",  "OR",    97487L
  )

And the code i'm using to geocode is:

library(censusxy)

system.time({
  dropme_dta <- 
    cxy_geocode(df, 
                street = 'num_street', 
                city = 'city', 
                state = 'state', 
                zip = 'zip_code', 
                return = 'geographies', 
                class = 'dataframe', 
                output = 'full', 
                parallel = 8, 
                vintage = 4,
                timeout = 30)
})

I am particularly in approaches that do not use loops and stay in the tidyverse. I.e. i think there may be a way using purrr::reduce() but for the life of me i haven't been able to figure it out.

Any pointers and i'd be most grateful!

P.S. I know that I can just pass all 20 addresses to the geocoder, but in practice I have about 4mn addresses and I want to keep track of what batch it's on by printing out the batch number

EDIT: based on feedback in comments, I agree that a loop is the best way forward. This is what I have so far:

library(tidygeocoder)

df <- df %>% 
  group_by(group_id = row_number() %/% 5)

for (x in 0:max(df$group_id)) {
  cat(paste("\rgeocoding batch", x, "of", max(df$group_id), "\n"))
  Sys.sleep(1)
  df %>% 
    geocode(street = num_street, city = city, state = state, postalcode = zip_code, 
           method = "census", full_results = TRUE, api_options = list(census_return_type = 'geographies'))
}

But I don't know how to iteratively build up the df. If I assign the geocode() function to something it's going to overwrite on each iteration.


Solution

  • Based on your latest edit you can save intermediate steps into a list and, then, join the final results into a tibble. Something like this:

    # Batches to be considered at each iteration
    ind  <- 1:4 # batches of 4 rows
    intL <- length(ind)
    nr   <- nrow(df)
    
    # List to allocate results
    l <- list()
    lind <- 1 
    
    # Loop
    continue <- TRUE
    while(continue){
      
      if(nr %in% ind){
        ind <- ind[ind <= nr]
        continue <- FALSE
      }
      
      l[[lind]] <- df[ind,] %>% 
        geocode(street = num_street, city = city, state = state, postalcode = zip_code, 
                method = "census", full_results = TRUE, api_options = list(census_return_type = 'geographies'))
      
      lind <- lind + 1 
      ind <-  ind  + intL
      
    }
    
    # Join all results
    do.call(rbind, l)
    

    In the list l you have the computations of each step, calling do.call you join the results in the same tibble.

    Given the large size of your dataset, you could run in memory issues before ending the loop. In this case you could save intermediate results to files (each n batches save the results to a file / empty the list / continue). All partial results can be joined in the end.

    Alternatively, you can try to build a dummy df of the same number of rows and columns as the expected one and substitute the values after each iteration. This approach may be slower.

    loop{
    
    df[ind, ] <- geocode(df[ind,], ...)
    
    }