Search code examples
rdplyrarcgis

Streamline reverse geocoding in R


I am using the reverse_geo() function to get the states to which a list of oil wells belong. The problem is that there are 80 thousand wells and the process becomes very slow. So, to speed up the search I would like to get only the subregion since it is the only data I need. How can I do that?

My code is:

library(readxl)
info_pozos <- read_excel("C:/Users/Kevin/Downloads/info_pozos.xlsx")


library(dplyr, warn.conflicts = FALSE)
library(tidygeocoder)

reverse <- info_pozos %>%
reverse_geocode(lat = coordenaday, long = coordenadax, method = 'arcgis',
address = address_found, full_results = FALSE)

View(reverse)

I expect to gain at least a couple of hours of time.


Solution

  • The bottleneck here is not the geographic granularity but rather the API requests to the geocoding service. Some geocoding services supported by tidygeocoder have batch processing capabilities, which should speed up the process somewhat.

    I’d recommend another strategy, though:

    1. Acquire shapefiles/geometry of the subregions/divisions.
    2. Perform a geographic join of well coordinates and division geometry.

    Assuming that the wells are in the US you can use the tigris package to obtain division geometries from the CENSUS Tiger Line service. The sf package can be used to perform a geographic join with st_join().

    For the example below, I downloaded HIFLD data for 1,506,228 oil and gas wells in the US. (That data already contains a state column. For the sake of the example, we ignore that column.) It takes ~15 seconds on my machine to perform the st_join() for all wells in the dataset.

    library(tidyverse)
    library(tigris)
    library(sf)
    options(tigris_use_cache = TRUE)
    
    # Obtain subregions/divisions with `tigris::divisions()` 
    # and rename NAME column to division, drop other non-geometry columns
    us_devisions <-
      divisions() |>
      select(division = NAME)
    
    # https://opendata.arcgis.com/api/v3/datasets/17c5ed5a6bd44dd0a52c616a5b0cacca_0/downloads/data?format=csv&spatialRefId=3857&where=1%3D1
    wells <-
      read_csv("Oil__and__Natural__Gas__Wells.csv")
    
    # drop some of the columns and turn well data into an `sf` object
    wells_sf <-
      wells |>
      select(ID, NAME, LONGITUDE, LATITUDE) |>
      st_as_sf(coords = c("LONGITUDE", "LATITUDE"),
               crs = st_crs(us_devisions))
    
    # perform geographic join
    system.time({
      res <-
        st_join(wells_sf, us_devisions, join = st_within)
    })
    #>    user  system elapsed 
    #>  15.013   0.336  15.371
    
    res
    #> Simple feature collection with 1506238 features and 3 fields
    #> Geometry type: POINT
    #> Dimension:     XY
    #> Bounding box:  xmin: -160.0238 ymin: 25.88125 xmax: -74.45766 ymax: 71.0696
    #> Geodetic CRS:  NAD83
    #> # A tibble: 1,506,238 × 4
    #>    ID         NAME                                       geometry division      
    #>  * <chr>      <chr>                                   <POINT [°]> <chr>         
    #>  1 W220000001 RRBB NAB PXY RA SU;J L SCALES  (-93.55748 32.06496) West South Ce…
    #>  2 W220000002 P SUHH;J B BARR 28             (-93.90817 32.08437) West South Ce…
    #>  3 W220000003 HOSS RB SUA;POLAND               (-92.74102 32.651) West South Ce…
    #>  4 W220000004 LODWICK LUMBER COMPANY         (-93.47957 32.60635) West South Ce…
    #>  5 W220000005 SL 1367                        (-90.17218 29.06883) <NA>          
    #>  6 W220000006 THE LOUISIANA FRUIT COMPANY    (-89.36671 29.23455) West South Ce…
    #>  7 W220000007 CONTINENTAL SECURITIES         (-93.48068 32.62225) West South Ce…
    #>  8 W220000008 LR CIB 32A 8 RA SU;SL 1450     (-90.34038 29.26782) <NA>          
    #>  9 W220000009 DL OPERC 2 RA SU;SCHWING L&S E (-91.32461 29.79531) West South Ce…
    #> 10 W220000010 HOSS A SUJJ;WOODARD-WALKER L   (-93.14482 32.48034) West South Ce…
    #> # ℹ 1,506,228 more rows