I need to extract information across different shapefiles for ~ 4 mio grid-cells of 1 ha. Currently I am using st_crop on each layer in a for-loop over all cells, but this runs forever. I thought to speed up the process in using a 'data.table'(DT)-sort-of-way to crop shapefiles by coordinates. Let's consider the example below, where I am looking for the extent of polygon edges in an area of interest:
require(sf)
require(data.table)
require(ggplot2)
require(tidyverse)
# load shapefile
nc = st_read(system.file("shape/nc.shp", package="sf"))
# Define a bounding-box that mimic a mowing-window or area of interest
bb <- st_bbox(c(xmin= -79, xmax=-78,ymin= 34.5, ymax= 35.5))
# Commute 'nc' into some sort of data.table object for fast subsetting, in preserving object's integrity (i.e. same id to all points of a given polygon)
nobs <- mapview::npts(nc,by_feature=T)
NC <- data.table::data.table(id=rep(1:nrow(nc),nobs),st_coordinates(nc)[,1:2])
head(NC)
# Compare cropping methods amon
library(microbenchmark)
x = runif(100)
test <- microbenchmark(
crop_nc <- st_crop(nc,bb),
crop_NC <- NC[X >= bb[1] & X < bb[3] & Y>= bb[2] & Y < bb[4]]
)
print(test)
Unit: microseconds
expr min lq mean median uq max neval cld
crop_nc 5205.051 5675.807 6837.9472 5903.219 6829.0865 16046.654 100 b
crop_NC 405.334 528.356 624.8398 576.996 656.9245 1295.361 100 a
There were 50 or more warnings (use warnings() to see the first 50)
As expected, the DT-way of subsetting is faster. Let's now go from our DT-object back to as sf-object as follow:
crop_NC_sf <- st_as_sf(crop_NC,coords=c("X","Y"),crs=st_crs(nc)) %>% group_by(id) %>% summarise(i=mean(id)) %>% st_cast("POLYGON")
Now compare the perimet of polygon's included in our study area:
sum(st_length(crop_nc),na.rm=T)
1307555 [m]
sum(st_length(crop_NC_sf),na.rm=T)
2610959 [m]
Obviously not working very well...
Questions:
is there another way to speed up st_crop()
is there a way to recreate a polygon from points in preserving the 'original' order points are connected to each others?
There is a little more to a proper intersection via st_intersects
(or st_crop
) than just subsetting a bunch of coordinates. Below you will find a working solution for your example that is trying to answer both of your questions.
For question 1 I would suggest that identifying the polygons that intersect with the bbox before cropping them can in many cases speed up things quite a bit, especially if you have many polygons and the area of the bbox is small compared to the extent of the polygons area.
Regarding question 2, you can use package sfheaders which provides methods to create sf objects from data.frames/data.tables etc.
Finally, I reshaped the timing code a little to better reflect what each of the steps need to do to provide similar output, thus making the comparison a lot fairer.
The final result of the data.table
way of things is different from the st_crop
as this approach only subsets coordinates to those inside the bbox, but does not insert the coordinates of the bbox at the appropriate position. This operation is likely costly, as we need to identify these correct positions. Therefore, you will get arbitrary shapes in the result, and thing may even break as you may likely produce invalid geometries.
In general, I'd stick with the st_crop
approach which will likely save you trouble later on. If your area of interest is big, has many polygons and the bbox you use is small in relation to the overall area, I expect that identifying the polygons that intersect with the bbox before doing the actual intersection will speed up things quite a bit.
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(mapview)
library(data.table)
library(ggplot2)
library(tidyverse)
library(sfheaders)
library(microbenchmark)
# load shapefile
nc = st_geometry(st_read(system.file("shape/nc.shp", package="sf")))
#> Reading layer `nc' from data source `C:\Users\Tim\R\win-library\3.5\sf\shape\nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID): 4267
#> proj4string: +proj=longlat +datum=NAD27 +no_defs
# Define a bounding-box that mimic a mowing-window or area of interest
bb <- st_bbox(c(xmin= -79, xmax=-78,ymin= 34.5, ymax= 35.5))
bb_sfc = st_as_sfc(bb)
st_crs(bb_sfc) <- st_crs(nc)
# Commute 'nc' into some sort of data.table object for fast subsetting, in preserving object's integrity (i.e. same id to all points of a given polygon)
nobs <- mapview::npts(nc,by_feature=T)
NC <- data.table::data.table(id=rep(1:length(nc),nobs),st_coordinates(nc)[,1:2])
head(NC)
#> id X Y
#> 1: 1 -81.47276 36.23436
#> 2: 1 -81.54084 36.27251
#> 3: 1 -81.56198 36.27359
#> 4: 1 -81.63306 36.34069
#> 5: 1 -81.74107 36.39178
#> 6: 1 -81.69828 36.47178
# Compare cropping methods amon
test <- microbenchmark(
sf_way = {
# identify subset of nc that intersects with bbox
isec = unlist(st_intersects(bb_sfc, nc))
crop_nc <- st_crop(nc[isec], bb_sfc)
},
dt_way = {
crop_NC <- NC[X >= bb[1] & X < bb[3] & Y>= bb[2] & Y < bb[4]]
crop_NC_sf <- sfheaders::sf_polygon(crop_NC, "X", "Y", polygon_id = "id")
st_crs(crop_NC_sf) = st_crs(nc)
}
)
print(test)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> sf_way 4.270801 4.492551 4.945424 4.828702 5.051951 8.666301 100 b
#> dt_way 3.101400 3.442551 3.705610 3.593301 3.887202 8.270801 100 a
sum(st_length(crop_nc),na.rm=T)
#> 1307555 [m]
sum(st_length(crop_NC_sf),na.rm=T)
#> 975793 [m]
mapview(st_bbox(crop_nc)) + crop_NC_sf
Created on 2019-11-29 by the reprex package (v0.3.0)