Drawing maps based on census data using ggplot2

I have a list of points that I want to overlay on a map of San Francisco using ggplot2. Each point is a longitude, latitude pair. I want the resulting map to be in a longitude/latitude coordinate system. I managed to reproduce Hadley Wickham's directions for plotting polygon shapefiles using his example file. I am using R 2.15.1 for Windows.

However, I tried to use cdp files downloaded from the UScensus2010cdp package. Here's my code snippet:

require("rgdal") 
require("maptools")
require("ggplot2")
require("sp")
require("plyr")
gpclibPermit() # required for fortify method
require(UScensus2010)
require(UScensus2010cdp)
data(california.cdp10)
sf <- city(name = "san francisco", state="ca")
sf.points = fortify(sf)

I get the following error:

Using name to define regions.
Error in unionSpatialPolygons(cp, invert(polys)) : input lengths differ
In addition: Warning message:
In split(as.numeric(row.names(attr)), addNA(attr[, region], TRUE)) :
   NAs introduced by coercion

Does anybody know:

What is a good value to give to the region parameter of fortify()?
If that fails, a source of map data with untransformed lat/long coordinates for San Francisco that ggplot2 can draw?
Alternatively, I found here another map of San Francisco, whose data is translated. Can you tell me how to either translate this data to raw lat/long or make the reverse translation for my set of points?

Solution

note:

unable to access UScensus2010cdp, so am using UScensus2000cpd which replicates the error.

The issue

The issue arises from the fact that fortify.SpatialPolygonsDataFrame relies on converting the row.names to numeric, and the rownames of your data are the identifiers.

ggplot2:::fortify.SpatialPolygonsDataFrame 

function (model, data, region = NULL, ...) 
{
    attr <- as.data.frame(model)
    if (is.null(region)) {
        region <- names(attr)[1]
        message("Using ", region, " to define regions.")
    }
    polys <- split(as.numeric(row.names(attr)), addNA(attr[, 
        region], TRUE))
    cp <- polygons(model)
    try_require(c("gpclib", "maptools"))
    unioned <- unionSpatialPolygons(cp, invert(polys))
    coords <- fortify(unioned)
    coords$order <- 1:nrow(coords)
    coords
}

In your case

row.names(sf@data)
## [1] "california_586" "california_590" "california_616"

are the identifiers you wish to use as the region parameters, as place state and name do not uniquely identify the three polygons.

# as.character used to coerce from factor
lapply(lapply(sf@data[,c('place','state','name')], unique), as.character)
## $place
## [1] "67000"
## 
## $state
## [1] "06"
## 
## $name
## [1] "San Francisco"

As a character vector where the elements begin with alphabetic characters, when coerced to numeric, it becomes NA

as.numeric(rownames(sf@data))
## [1] NA NA NA
## Warning message:
## NAs introduced by coercion

Which is one of the warnings given

Solution

Define a column to be the rownames
Set the row.names to NULL or 1:nrow(sf@data)

So..

# rownames
sf@data[['place_id']] <- rownames(sf@data)
row.names(sf@data) <- NULL

# fortify
sf_ggplot <- fortify(sf, region = 'place_id')
# merge to add the original data
sf_ggplot_all <- merge(sf_ggplot, sf@data, by.x = 'id', by.y = 'place_id')
# very basic and uninteresting plot
ggplot(sf_ggplot_all,aes(x=long,y=lat, group = group)) + 
  geom_polygon(aes(fill =pop2000)) + 
  coord_map()

enter image description here