Search code examples
ropenstreetmaposm.pbf

Using oe_get() from osmextract package to translate and read only selection of keys/tags from a pbf into a gpkg


Using the answer from a previous question (Structure Query for osmextract ), I want to use the R package osmextract to get a subset of keys/tags (e.g. shop=supermarket) for the whole Europe.osm.pbf from Geofabrik.

I am wondering if it is possible to avoid oe_get() translating ALL features of a layer (e.g. point, line or multipolygon) into a gpkg in the first place and conversely just have it translate a subset of features into a gpkg (e.g. only amenity=school).

The previous response does this in two steps:

  1. Load the data and put each layer into a gpkg
  2. Select specific features from a layer within this gpkg with a query

This leads to the production of a possibly huge gpkg in the first step, from which then a selection is made. It would be a saving of processing time and file size if these two steps could be done in one. I am not used to working with databases or similar, thats why I don't know if this kind of procedure is even possible with pbf files or if the translation into a gpkg is necessary to be able to do a query of this sort in the first place.

Some example code of my try with a mid-sized pbf file (Austria):

## STEP 1: (Download and) convert data into gpkg. (NB skipping dl here because file already downloaded)

## Each layer (line, multipolygon, point) will be consecutively added to the gpkg 'geofabrik_austria-latest'

oe_get("Austria", layer = "lines", provider = "geofabrik", force_download = FALSE, extra_tags = c("maxspeed", "oneway"), download_only = TRUE)
# This took about 8 minutes with my laptop (pbf filesize 2.5% of Europe.osm.pbf)
oe_get("Austria", layer = "multipolygons", provider = "geofabrik", force_download = FALSE, download_only = TRUE)
# This took about 13 minutes with my laptop (pbf filesize 2.5% of Europe.osm.pbf)
oe_get("Austria", layer = "points", provider = "geofabrik", force_download = FALSE, extra_tags = c("amenity", "shop"), download_only = TRUE)
# This took about 2 minutes with my laptop (pbf filesize 2.5% of Europe.osm.pbf)

# Total processing time: about 23 mins (Europe.osm.pbf file is 40 times bigger and estimated to take 15.3 hrs)
# gpkg file size 5.9 times bigger than pbf filesize (geofabrik_europe-latest.gpkg filesize is estimated 158.4 GB)

## STEP 2: Manipulate gpkg: Select keys/features

austria_roads_lines_gpkg <- oe_get(
  place = "Austria", 
  layer = "lines", 
  query = "SELECT * FROM lines WHERE highway IN ('motorway', 'trunk', 'primary', 'secondary', 'tertiary', 'unclassified', 'residential', 'motorway_link', 'trunk_link', 'primary_link', 'secondary_link', 'tertiary_link', 'living_street', 'service')",
  quiet = TRUE
)
austria_shops_multipoly_gpkg <- oe_get(
  place = "Austria", 
  layer = "multipolygons",
  query = "SELECT * FROM multipolygons WHERE shop IN ('convenience', 'greengrocer', 'supermarket')",
  quiet = TRUE
)
austria_shops_points_gpkg <- oe_get(
  place = "Austria", 
  layer = "points",
  query = "SELECT * FROM points WHERE shop IN ('convenience', 'greengrocer', 'supermarket')",
  quiet = TRUE
)

Solution

  • I actually found a solution myself by applying an SQL-like query within a character vector with the options that will be passed to ogr2ogr during the vectortranslate process. Here is an example code for extracting marketplaces and shops from the point and multipolygon layers of the Europe.osm.pbf from Geofabrik:

    library(osmextract)
    library(usethis)
    
    #Set a persistent download directory
    usethis::edit_r_environ() #add this command to the .Renviron file: OSMEXT_DOWNLOAD_DIRECTORY=/path/to/osm/data #Restart R for changes to take effect
    oe_download_directory() #check download dir
    list.files(oe_download_directory(), pattern = "pbf|gpkg") #list existing files in the download dir
    ## CAUTION: osmextract only detects the filename if it is structured like: geofabrik_country* 
    
    ### For big extractions also set tempdir, because GDAL/ogr2ogr first writes extracts in huge tempfiles and then in gpkg
    ### The driver will use an internal SQLite database to resolve geometries. If that database remains under 100 MB it will reside in RAM. If it grows above, it will be written in a temporary file on disk. By default, this file will be written in the current directory, unless you define the CPL_TMPDIR configuration option. The 100 MB default threshold can be adjusted with the OSM_MAX_TMPFILE_SIZE configuration option (value in MB).
    
    Sys.setenv(CPL_TMPDIR="D:\\23-02-16_OSM_Geofabrik_Europe") #to keep gdal from writing in C:\Users\xxx\AppData\Local\Temp folder and rather put temp files in the download directory!
    
    ## POINT LAYER
    
    # SQL-like query
    point_amenities_shop_vectortranslate = c(
      "-nln", "OSM_20230216_Europe_shop_point", 
      "-t_srs", "EPSG:3035", 
      "-select", "osm_id, name, shop, amenity, access", 
      "-where", "amenity = 'marketplace' OR shop in ('convenience','greengrocer','butcher','bakery','supermarket')"
    )
    
    # oe_get saves a gpkg into the defined directory with the extracted features
    options(nwarnings = 10000)
    system.time({
      oe_get("Europe", layer = "points", provider = "geofabrik", force_download = FALSE, vectortranslate_options = point_amenities_shop_vectortranslate, extra_tags = c("shop", "amenity", "access"), download_only = TRUE, skip_vectortranslate = FALSE, never_skip_vectortranslate = TRUE, quiet = FALSE)
    }) # 535,787 features extracted in 637 seconds
    
    ## MULTIPOLYGON LAYER
    
    poly_amenities_shop_vectortranslate = c(
      "-nln", "OSM_20230216_Europe_shop_poly", 
      "-t_srs", "EPSG:3035", 
      "-select", "osm_id, name, shop, amenity, access", 
      "-where", "amenity = 'marketplace' OR shop in ('convenience','greengrocer','butcher','bakery','supermarket')"
    )
    
    options(nwarnings = 10000)
    system.time({
      oe_get("Europe", layer = "multipolygons", provider = "geofabrik", force_download = FALSE, vectortranslate_options = poly_amenities_shop_vectortranslate, extra_tags = c("shop", "amenity", "access"), download_only = TRUE, skip_vectortranslate = FALSE, never_skip_vectortranslate = TRUE, quiet = FALSE)
    }) # 172,846 features in  20,005 seconds 
        # 18 warnings: non closed ring detected