Search code examples
rdictionarygisshapefilecensus

How to combine state-level shapefiles from the united states census bureau into a nationwide shape


The census bureau doesn't provide a nationwide shapefile of public use microdata areas (the smallest geography available on the American Community Survey). I tried combining them all with a few different methods, but even the one that de-dupes identifiers breaks once it hits California. Am I doing something silly or does this require a difficult workaround? Here's code to reproduce up to the point where things break.

library(taRifx.geo)
library(maptools)

td <- tempdir() ; tf <- tempfile()
setInternet2( TRUE )
download.file( "ftp://ftp2.census.gov/geo/tiger/TIGER2014/PUMA/" , tf )

al <- readLines( tf )
tl <- al[ grep( "geo/tiger/TIGER2014/PUMA/tl_2014_" , al ) ]
fp <- gsub( "(.*)geo/tiger/TIGER2014/PUMA/tl_2014_([0-9]*)_puma10\\.zip(.*)" , "\\2" , tl )

# get rid of alaska
fp <- fp[ fp != '02' ]

af <- paste0( "ftp://ftp2.census.gov/geo/tiger/TIGER2014/PUMA/tl_2014_" , fp , "_puma10.zip" )

d <- NULL
for ( i in af ){
    try( file.remove( z ) , silent = TRUE )
    download.file( i , tf , mode = 'wb' )
    z <- unzip( tf , exdir = td )
    b <- readShapePoly( z[ grep( 'shp$' , z ) ] )
    if ( is.null( d ) ) d <- b else d <- taRifx.geo:::rbind.SpatialPolygonsDataFrame( d , b , fix.duplicated.IDs = TRUE )
}

# Error in `row.names<-.data.frame`(`*tmp*`, value = c("d.0", "d.1", "d.2",  : 
  # duplicate 'row.names' are not allowed
# In addition: Warning message:
# non-unique values when setting 'row.names': ‘d.0’, ‘d.1’, ‘d.10’, ‘d.11’, ‘d.12’, ‘d.13’, ‘d.14’, ‘d.15’, ‘d.16’, ‘d.17’, ‘d.18’, ‘d.19’, ‘d.2’, ‘d.3’, ‘d.4’, ‘d.5’, ‘d.6’, ‘d.7’, ‘d.8’, ‘d.9’ 

Solution

  • Here's another approach, which includes a short cut for obtaining the FTP directory listing. As @Pop mentioned, the key is to ensure that the IDs are all unique.

    library(RCurl) 
    library(rgdal)
    
    # get the directory listing
    u <- 'ftp://ftp2.census.gov/geo/tiger/TIGER2014/PUMA/'
    f <- paste0(u, strsplit(getURL(u, ftp.use.epsv = FALSE, ftplistonly = TRUE), 
                            '\\s+')[[1]])
    
    # download and extract to tempdir/shps
    invisible(sapply(f, function(x) {
      path <- file.path(tempdir(), basename(x))
      download.file(x, destfile=path, mode = 'wb')
      unzip(path, exdir=file.path(tempdir(), 'shps'))
    }))
    
    # read in all shps, and prepend shapefile name to IDs
    shps <- lapply(sub('\\.zip', '', basename(f)), function(x) {
      shp <- readOGR(file.path(tempdir(), 'shps'), x)
      shp <- spChFIDs(shp, paste0(x, '_', sapply(slot(shp, "polygons"), slot, "ID")))
      shp
    })
    
    # rbind to a single object
    shp <- do.call(rbind, as.list(shps))
    
    # plot (note: clipping to contiguous states for display purposes)
    plot(shp, axes=T, xlim=c(-130, -60), ylim=c(20, 50), las=1)
    
    # write out to wd/USA.shp
    writeOGR(shp, '.', 'USA', 'ESRI Shapefile')
    

    unified shp