Search code examples
ropenstreetmapropensci

Return only non-missing values from osm search


I'm trying to just get non-na values returned from osmdata. For example, take email address. However, the below returns mostly missing emails. How can I set up the query so that it returns only non-missing values... value = "!null" didn't work either.

library(osmdata)

san <- opq(bbox = 'San Jose, California') %>%
  add_osm_feature(key = 'email') %>%
  osmdata_sf()
df <- san$osm_points
nrow(df)
sum(!is.na(df$email))

Solution

  • The osmdata package follows the same hierarchical structure as the Open Street Map data themselves. If you look into your data a bit further you'll see the following:

    library (osmdata)
    #> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
    san <- opq(bbox = 'San Jose, California') %>%
        add_osm_feature(key = 'email') %>%
        osmdata_sf()
    df <- san$osm_points
    nrow(df)
    #> [1] 738
    sum(!is.na(df$email))
    #> [1] 39
    
    length (which (is.na (san$osm_multipolygons$email)))
    #> [1] 0
    length (which (is.na (san$osm_polygons$email)))
    #> [1] 3
    length (which (is.na (san$osm_points$email)))
    #> [1] 699
    

    The osm_multipolygons are the highest-level objects in the hierarchy, and all of them have email addresses. Each of those also consists of numerous polygons, not all of which will necessarily have email addresses, and so there are 3 polygons with no email. The points list by default, both in OSM itself and osmdata, includes every single point that is part of every higher-level object, and so unavoidably includes very many points with no email addresses. So there is no way you can issue a query only for objects in each category which do not have missing values (see further information in repo issue#221).

    The result you desire can nevertheless be obtained via the unique_osmdata() function, which reduces each of the different types of objects down to only those unique values as requested in the original call. This should give you what you want:

    san_unique <- unique_osmdata (san)
    length (which (is.na (san_unique$osm_multipolygons$email)))
    #> [1] 0
    length (which (is.na (san_unique$osm_polygons$email)))
    #> [1] 0
    length (which (is.na (san_unique$osm_points$email)))
    #> [1] 0
    
    san_unique <- unique_osmdata (san)
    san_unique
    #> Object of class 'osmdata' with:
    #>                  $bbox : 37.124503,-122.045672,37.4692175,-121.589153
    #>         $overpass_call : The call submitted to the overpass API
    #>                  $meta : metadata including timestamp and version numbers
    #>            $osm_points : 'sf' Simple Features Collection with 39 points
    #>          $osm_polygons : 'sf' Simple Features Collection with 36 polygons
    #>        $osm_multilines : NULL
    #>     $osm_multipolygons : 'sf' Simple Features Collection with 1 multipolygons
    length (which (is.na (san_unique$osm_multipolygons$email))) 
    #> [1] 0
    length (which (is.na (san_unique$osm_polygons$email)))
    #> [1] 0
    length (which (is.na (san_unique$osm_points$email)))
    #> [1] 0