Search code examples
sqlrr-sfgeopackage

Read only the rows of spatial data with specific column value using st_read and SQL


I have a spatial data set and I would like to read to memory only those rows that match a given value in a certain column. Here is a reproducible example with what I've tried.

Using a shape file, I can read the entire data set, but the code throws and error when I try to filter by column value. It looks like it does not recognize the column name

# read entire data set
shp_file <- system.file("shape/nc.shp", package="sf")

data <- read_sf(shp_file, query = "SELECT * FROM nc")

# select rows with specific column value
data <- read_sf(file, query = "SELECT * FROM nc WERE FIPSNO = 37009")

> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,
> :    Query execution failed, cannot open layer. In addition: Warning
> message: In CPL_read_ogr(dsn, layer, query, as.character(options),
> quiet,  :   GDAL Error 1: In ExecuteSQL(): sqlite3_prepare_v2(SELECT *
> FROM nc WERE FIPSNO = 37009):   near "FIPSNO": syntax error

Now, using a data set in saved in geopackage format, I cannot even read the entire data set to memory. In this case, it does not seem to recognize the table.

gpkg_file <- system.file("gpkg/nc.gpkg", package="sf")

data <- read_sf(gpkg_file, query = "SELECT * FROM nc")

> Error in CPL_read_ogr(dsn, layer, query, as.character(options), quiet,
> :    Query execution failed, cannot open layer. In addition: Warning
> message: In CPL_read_ogr(dsn, layer, query, as.character(options),
> quiet,  :   GDAL Error 1: In ExecuteSQL(): sqlite3_prepare_v2(SELECT *
> FROM nc):   no such table: nc


Solution

  • Your approach feels reasonable; it is somewhat difficult to troubleshoot it without having access to your actual dataset, but we can do with the world geopackage that ships with {spData} package.

    The my first step would be to check the names of layers stored in your geopackage; this will tell you what are legitimate tables to which you can point your WHERE clause.

    The second step is running your actual read operation; while sf::st_read() and sf::read_sf() are in principle the same thing - they differ only in defaults for some of the optional parameters - I suggest to use sf::st_read() when in doubt.

    In instances of troubleshooting it is also a good idea to assign your parameters explicitly, and not rely on position (dsn should be the first argument by position, but having it assigned by name is bulletproof).

    library(sf)
    
    # a GPKG dataset - it lives in spData package
    world_dataset <- system.file("shapes/world.gpkg", package = "spData")
    
    # check layers - there may be more than one!
    st_layers(world_dataset)
    
    # run the query; note the use of dsn and addressing a layer known to exist
    europe <- st_read(dsn = world_dataset,
                      query = "select * from world where continent = 'Europe'")
    
    # a visual check
    plot(st_geometry(europe))
    

    enter image description here

    While I believe that reasoneable people can differ in their opinion whether Russian Federation or French Guiana actually belong to Europe the SQL query on technical level clearly works.