Search code examples
rkmlgeospatial

How to filter a dataframe of geocoordinates using a KML polygon?


I have a CSV of data points that span the entire Earth (the US Geological Service's earthquake feed), and I want to filter for only earthquakes within the United States' border.

The KML file I've pulled from the U.S. Census Bureau:

https://www.census.gov/geo/maps-data/data/kml/kml_nation.html

In R, the rgdal library can load KML files:

library(rgdal)
kml = readOGR("kmls/cb_2014_us_nation_20m.kml", 'cb_2014_us_nation_20m')

How do I use dplyr/plyr/etc. to filter a data.frame (the columns for the geo data are latitude and longitude) for only the rows that fall within the boundaries specified by the KML?


Edit, post-answer:

Here's what I used from @hrbrmstr's answer to get a quick visualization:

library(sp)
library(rgdal)
# download earthquakes
url <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.csv"
fil <- "all_week.csv"
if (!file.exists(fil)) download.file(url, fil)
quakes <- read.csv("all_week.csv", stringsAsFactors=FALSE)
# create spatial object
sp::coordinates(quakes) <- ~longitude+latitude

# download nation KML
url <- "http://www2.census.gov/geo/tiger/GENZ2014/kml/cb_2014_us_nation_20m.zip"
fil <- "uskml.zip"
if (!file.exists(fil)) download.file(url, fil)
unzip(fil, exdir="uskml")
# read KML file
us <- rgdal::readOGR("./uskml/cb_2014_us_nation_20m.kml", "cb_2014_us_nation_20m")
sp::proj4string(quakes) <- sp::proj4string(us)

length(quakes)
# 1514

usquakes = quakes[us,]
length(usquakes)
# 1260

### visualize
plot(us) 
# plot all quakes
points(quakes$longitude, quakes$latitude)
# plot just US
points(usquakes$longitude, usquakes$latitude)

Resulting image:

US Geological Services detected earthquakes for one week; US quakes plotted in red

Thanks @hrbrmstr!


Solution

  • This'll do it a cpl ways:

    library(sp)
    library(maptools)
    library(rgeos) # not entirely necessary
    library(rgdal) # not entirely necessary
    
    url <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_week.csv"
    fil <- "all_week.csv"
    if (!file.exists(fil)) download.file(url, fil)
    
    quakes <- read.csv("all_week.csv", stringsAsFactors=FALSE)
    coordinates(quakes) <- ~longitude+latitude
    
    url <- "http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_nation_20m.zip"
    fil <- "us.zip"
    if (!file.exists(fil)) download.file(url, fil)
    unzip(fil, exdir="us")
    us <- readShapePoly("us/cb_2014_us_nation_20m.shp")
    
    # alternatively
    # us <- rgdal::readOGR("us/cbcb_2014_us_nation_20m.shp", "cb_2014_us_nation_20m")
    
    # TRUE if in
    in_us_rgeos <- rgeos::gIntersects(quakes, us, byid=TRUE)
    
    # <NA> if in
    in_us_over <- quakes %over% us
    

    gIntersects takes longer. rgdal and rgeos can be a bear to get working on some systems. YMMV

    Using the US KML will require (for the most part) rgdal:

    # you wanted KML shapefile tho
    url <- "http://www2.census.gov/geo/tiger/GENZ2014/kml/cb_2014_us_nation_20m.zip"
    fil <- "uskml.zip"
    if (!file.exists(fil)) download.file(url, fil)
    unzip(fil, exdir="uskml")
    
    us <- rgdal::readOGR("uskml/cb_2014_us_nation_20m.kml", "cb_2014_us_nation_20m")
    proj4string(quakes) <- proj4string(us)
    rgeos::gIntersects(quakes, us, byid=TRUE)
    head(quakes %over% us)