Search code examples
rdata.tablespatialr-sf

Method(s) to speed up st_crop (sf package) on large datasets


I need to extract information across different shapefiles for ~ 4 mio grid-cells of 1 ha. Currently I am using st_crop on each layer in a for-loop over all cells, but this runs forever. I thought to speed up the process in using a 'data.table'(DT)-sort-of-way to crop shapefiles by coordinates. Let's consider the example below, where I am looking for the extent of polygon edges in an area of interest:

require(sf)
require(data.table)
require(ggplot2)
require(tidyverse)

# load shapefile
nc = st_read(system.file("shape/nc.shp", package="sf"))


# Define a bounding-box that mimic a mowing-window or area of interest
bb <- st_bbox(c(xmin= -79, xmax=-78,ymin= 34.5, ymax= 35.5))


# Commute 'nc' into some sort of data.table object for fast subsetting, in preserving object's integrity (i.e. same id to all points of a given polygon)
nobs <- mapview::npts(nc,by_feature=T)
NC <- data.table::data.table(id=rep(1:nrow(nc),nobs),st_coordinates(nc)[,1:2])
head(NC)

# Compare cropping methods amon
library(microbenchmark)
x = runif(100)
test <- microbenchmark(
  crop_nc <- st_crop(nc,bb),
  crop_NC <- NC[X >= bb[1] & X < bb[3] & Y>= bb[2] & Y < bb[4]]
)  

print(test)
Unit: microseconds
  expr      min       lq      mean   median        uq       max neval cld
 crop_nc  5205.051 5675.807 6837.9472 5903.219 6829.0865 16046.654   100   b
 crop_NC   405.334  528.356  624.8398  576.996  656.9245  1295.361   100  a 
There were 50 or more warnings (use warnings() to see the first 50)

As expected, the DT-way of subsetting is faster. Let's now go from our DT-object back to as sf-object as follow:

crop_NC_sf <- st_as_sf(crop_NC,coords=c("X","Y"),crs=st_crs(nc))  %>% group_by(id)  %>%  summarise(i=mean(id)) %>% st_cast("POLYGON")

Now compare the perimet of polygon's included in our study area:

sum(st_length(crop_nc),na.rm=T)
1307555 [m]

sum(st_length(crop_NC_sf),na.rm=T)
2610959 [m]

Obviously not working very well...

Result

Questions:

  • is there another way to speed up st_crop()

  • is there a way to recreate a polygon from points in preserving the 'original' order points are connected to each others?


Solution

  • There is a little more to a proper intersection via st_intersects(or st_crop) than just subsetting a bunch of coordinates. Below you will find a working solution for your example that is trying to answer both of your questions.

    For question 1 I would suggest that identifying the polygons that intersect with the bbox before cropping them can in many cases speed up things quite a bit, especially if you have many polygons and the area of the bbox is small compared to the extent of the polygons area.

    Regarding question 2, you can use package sfheaders which provides methods to create sf objects from data.frames/data.tables etc.

    Finally, I reshaped the timing code a little to better reflect what each of the steps need to do to provide similar output, thus making the comparison a lot fairer.

    The final result of the data.table way of things is different from the st_crop as this approach only subsets coordinates to those inside the bbox, but does not insert the coordinates of the bbox at the appropriate position. This operation is likely costly, as we need to identify these correct positions. Therefore, you will get arbitrary shapes in the result, and thing may even break as you may likely produce invalid geometries.

    In general, I'd stick with the st_crop approach which will likely save you trouble later on. If your area of interest is big, has many polygons and the bbox you use is small in relation to the overall area, I expect that identifying the polygons that intersect with the bbox before doing the actual intersection will speed up things quite a bit.

    library(sf)
    #> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
    library(mapview)
    library(data.table)
    library(ggplot2)
    library(tidyverse)
    library(sfheaders)
    library(microbenchmark)
    
    # load shapefile
    nc = st_geometry(st_read(system.file("shape/nc.shp", package="sf")))
    #> Reading layer `nc' from data source `C:\Users\Tim\R\win-library\3.5\sf\shape\nc.shp' using driver `ESRI Shapefile'
    #> Simple feature collection with 100 features and 14 fields
    #> geometry type:  MULTIPOLYGON
    #> dimension:      XY
    #> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
    #> epsg (SRID):    4267
    #> proj4string:    +proj=longlat +datum=NAD27 +no_defs
    
    # Define a bounding-box that mimic a mowing-window or area of interest
    bb <- st_bbox(c(xmin= -79, xmax=-78,ymin= 34.5, ymax= 35.5))
    bb_sfc = st_as_sfc(bb)
    st_crs(bb_sfc) <- st_crs(nc)
    
    
    # Commute 'nc' into some sort of data.table object for fast subsetting, in preserving object's integrity (i.e. same id to all points of a given polygon)
    nobs <- mapview::npts(nc,by_feature=T)
    NC <- data.table::data.table(id=rep(1:length(nc),nobs),st_coordinates(nc)[,1:2])
    head(NC)
    #>    id         X        Y
    #> 1:  1 -81.47276 36.23436
    #> 2:  1 -81.54084 36.27251
    #> 3:  1 -81.56198 36.27359
    #> 4:  1 -81.63306 36.34069
    #> 5:  1 -81.74107 36.39178
    #> 6:  1 -81.69828 36.47178
    
    # Compare cropping methods amon
    test <- microbenchmark(
        sf_way = {
            # identify subset of nc that intersects with bbox
            isec = unlist(st_intersects(bb_sfc, nc))
            crop_nc <- st_crop(nc[isec], bb_sfc)
        }, 
        dt_way = {
            crop_NC <- NC[X >= bb[1] & X < bb[3] & Y>= bb[2] & Y < bb[4]]
            crop_NC_sf <- sfheaders::sf_polygon(crop_NC, "X", "Y", polygon_id = "id")
            st_crs(crop_NC_sf) = st_crs(nc)
        }
    )  
    
    print(test)
    #> Unit: milliseconds
    #>    expr      min       lq     mean   median       uq      max neval cld
    #>  sf_way 4.270801 4.492551 4.945424 4.828702 5.051951 8.666301   100   b
    #>  dt_way 3.101400 3.442551 3.705610 3.593301 3.887202 8.270801   100  a
    
    sum(st_length(crop_nc),na.rm=T)
    #> 1307555 [m]
    sum(st_length(crop_NC_sf),na.rm=T)
    #> 975793 [m]
    
    mapview(st_bbox(crop_nc)) + crop_NC_sf
    

    Created on 2019-11-29 by the reprex package (v0.3.0)