Search code examples
rgistiger-census

Choropleth Maps in R - TIGER Shapefile issue


Have a Question on Mapping with R, specifically around the choropleth maps in R.

I have a dataset of ZIP codes assigned to an are and some associated data (dataset is here).

My final data format is: Area ID, ZIP, Probability Value, Customer Count, Area Probability and Area Customer Total. I am attempting to present this data by plotting area probability and Area Customer Total on a Map. I have tried to do this by using the census TIGER Shapefiles but I guess R cannot handle the complete country.

I am comfortable with the Statistical capabilities and now I am moving all my Mapping from third party GIS focused applications to doing all my Mapping in R. Does anyone have any pointers to how to achieve this from within R?

To be a little more detailed, here's the point where R stops working -

shapes <- readShapeSpatial("tl_2013_us_zcta510.shp")

(where the shp file is the census/TIGER) shape file.

Edit - Providing further details. I am trying to first read the TIGER shapefiles, hoping to combine this spatial dataset with my data and eventually plot. I am having an issue at the very beginning when attempting to read the shape file. Below is the code with the output

require(maptools)
shapes<-readShapeSpatial("tl_2013_us_zcta510.shp")

Error: cannot allocate vector of size 317 Kb

Solution

  • There are several examples and tutorials on making maps using R, but most are very general and, unfortunately, most map projects have nuances that create inscrutable problems. Yours is a case in point.

    The biggest issue I came across was that the US Census Bureau zip code tabulation area shapefile for the whole US is huge: ~800MB. When loaded using readOGR(...) the R SpatialPolygonDataFrame object is about 913MB. Trying to process a file this size, (e.g., converting to a data frame using fortify(...)), at least on my system, resulted in errors like the one you identified above. So the solution is to subset the file based in the zip codes that are actually in your data.

    This map:

    was made from your data using the following code.

    library(rgdal)
    library(ggplot2)
    library(stringr)
    library(RColorBrewer)
    
    setwd("<directory containing shapfiles and sample data>")
    
    data     <- read.csv("Sample.csv",header=T) # your sample data, downloaded as csv
    data$ZIP <- str_pad(data$ZIP,5,"left","0") # convert ZIP to char(5) w/leading zeros
    
    zips     <- readOGR(dsn=".","tl_2013_us_zcta510") # import zip code polygon shapefile
    map      <- zips[zips$ZCTA5CE10 %in% data$ZIP,]   # extract only zips in your Sample.csv
    map.df   <- fortify(map)        # convert to data frame suitable for plotting
    # merge data from Samples.csv into map data frame
    map.data <- data.frame(id=rownames(map@data),ZIP=map@data$ZCTA5CE10)
    map.data <- merge(map.data,data,by="ZIP")
    map.df   <- merge(map.df,map.data,by="id")
    # load state boundaries
    states <- readOGR(dsn=".","gz_2010_us_040_00_5m")
    states <- states[states$NAME %in% c("New York","New Jersey"),] # extract NY and NJ
    states.df <- fortify(states)    # convert to data frame suitable for plotting
    
    ggMap <- ggplot(data = map.df, aes(long, lat, group = group)) 
    ggMap <- ggMap + geom_polygon(aes(fill = Probability_1))
    ggMap <- ggMap + geom_path(data=states.df, aes(x=long,y=lat,group=group))
    ggMap <- ggMap + scale_fill_gradientn(name="Probability",colours=brewer.pal(9,"Reds"))
    ggMap <- ggMap + coord_equal()
    ggMap
    

    Explanation:

    The rgdal package facilitates the creation of R Spatial objects from ESRI shapefiles. In your case we are importing a polygon shapefile into a SpatialPolygonDataFrame object in R. The latter has two main parts: a polygon section, which contains the latitude and longitude points that will be joined to create the polygons on the map, and a data section which contains information about the polygons (so, one row for each polygon). If, e.g., we call the Spatial object map, then the two sections can be referenced as map@polygons and map@data. The basic challenge in making choropleth maps is to associate data from your Sample.csv file, with the relevant polygons (zip codes).

    So the basic workflow is as follows:

    1. Load polygon shapefiles into Spatial object ( => zips)
    2. Subset if appropriate ( => map).
    3. Convert to data frame suitable for plotting ( => map.df).
    4. Merge data from Sample.csv into map.df.
    5. Draw the map.
    

    Step 4 is the one that causes all the problems. First we have to associate zip codes with each polygon. Then we have to associate Probability_1 with each zip code. This is a three step process.

    Each polygon in the Spatial data file has a unique ID, but these ID's are not the zip codes. The polygon ID's are stored as row names in map@data. The zip codes are stored in map@data, in column ZCTA5CE10. So first we must create a data frame that associates the map@data row names (id) with map@data$ZCTA5CE10 (ZIP). Then we merge your Sample.csv with the result using the ZIP field in both data frames. Then we merge the result of that into map.df. This can be done in 3 lines of code.

    Drawing the map involves telling ggplot what dataset to use (map.df), which columns to use for x and y (long and lat) and how to group the data by polygon (group=group). The columns long, lat, and group in map.df are all created by the call to fortify(...). The call to geom_polygon(...) tells ggplot to draw polygons and fill using the information in map.df$Probability_1. The call to geom_path(...) tells ggplot to create a layer with state boundaries. The call to scale_fill_gradientn(...) tells ggplot to use a color scheme based on the color brewer "Reds" palette. Finally, the call to coord_equal(...) tells ggplot to use the same scale for x and y so the map is not distorted.

    NB: The state boundary layer, uses the US States TIGER file.