I am using the reverse_geo() function to get the states to which a list of oil wells belong. The problem is that there are 80 thousand wells and the process becomes very slow. So, to speed up the search I would like to get only the subregion since it is the only data I need. How can I do that?
My code is:
library(readxl)
info_pozos <- read_excel("C:/Users/Kevin/Downloads/info_pozos.xlsx")
library(dplyr, warn.conflicts = FALSE)
library(tidygeocoder)
reverse <- info_pozos %>%
reverse_geocode(lat = coordenaday, long = coordenadax, method = 'arcgis',
address = address_found, full_results = FALSE)
View(reverse)
I expect to gain at least a couple of hours of time.
The bottleneck here is not the geographic granularity but rather the
API requests to the geocoding service. Some geocoding services
supported by tidygeocoder
have batch processing capabilities, which
should speed up the process somewhat.
I’d recommend another strategy, though:
Assuming that the wells are in the US you can use the tigris
package
to obtain division geometries from the CENSUS Tiger Line service. The sf
package can be used to perform a geographic join with st_join()
.
For the example below, I downloaded HIFLD
data for 1,506,228 oil and gas wells in the US. (That data already contains
a state column. For the sake of the example, we ignore that column.)
It takes ~15 seconds on my machine to perform the st_join()
for
all wells in the dataset.
library(tidyverse)
library(tigris)
library(sf)
options(tigris_use_cache = TRUE)
# Obtain subregions/divisions with `tigris::divisions()`
# and rename NAME column to division, drop other non-geometry columns
us_devisions <-
divisions() |>
select(division = NAME)
# https://opendata.arcgis.com/api/v3/datasets/17c5ed5a6bd44dd0a52c616a5b0cacca_0/downloads/data?format=csv&spatialRefId=3857&where=1%3D1
wells <-
read_csv("Oil__and__Natural__Gas__Wells.csv")
# drop some of the columns and turn well data into an `sf` object
wells_sf <-
wells |>
select(ID, NAME, LONGITUDE, LATITUDE) |>
st_as_sf(coords = c("LONGITUDE", "LATITUDE"),
crs = st_crs(us_devisions))
# perform geographic join
system.time({
res <-
st_join(wells_sf, us_devisions, join = st_within)
})
#> user system elapsed
#> 15.013 0.336 15.371
res
#> Simple feature collection with 1506238 features and 3 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -160.0238 ymin: 25.88125 xmax: -74.45766 ymax: 71.0696
#> Geodetic CRS: NAD83
#> # A tibble: 1,506,238 × 4
#> ID NAME geometry division
#> * <chr> <chr> <POINT [°]> <chr>
#> 1 W220000001 RRBB NAB PXY RA SU;J L SCALES (-93.55748 32.06496) West South Ce…
#> 2 W220000002 P SUHH;J B BARR 28 (-93.90817 32.08437) West South Ce…
#> 3 W220000003 HOSS RB SUA;POLAND (-92.74102 32.651) West South Ce…
#> 4 W220000004 LODWICK LUMBER COMPANY (-93.47957 32.60635) West South Ce…
#> 5 W220000005 SL 1367 (-90.17218 29.06883) <NA>
#> 6 W220000006 THE LOUISIANA FRUIT COMPANY (-89.36671 29.23455) West South Ce…
#> 7 W220000007 CONTINENTAL SECURITIES (-93.48068 32.62225) West South Ce…
#> 8 W220000008 LR CIB 32A 8 RA SU;SL 1450 (-90.34038 29.26782) <NA>
#> 9 W220000009 DL OPERC 2 RA SU;SCHWING L&S E (-91.32461 29.79531) West South Ce…
#> 10 W220000010 HOSS A SUJJ;WOODARD-WALKER L (-93.14482 32.48034) West South Ce…
#> # ℹ 1,506,228 more rows