County area calculated from NLCD (Landcover data) rasters is too large

I'm trying to calculate landcover repartition for each US county. I have obtained NLCD for the Apache county using the FedData package (devtools version) and I'm using county shapefiles from the Census bureau.

The problem is that I get an area that is much larger than the official one and the one indicated in my shapefile, namely 51,000km^2 instead of 29,0000km^2 officially. There must be something I don't understand about the raster object but I'm a very confused after hours of websearching, any help appreciated.

The following describes the code used and the method used to calculate. The county data can be downloaded here: https://www2.census.gov/geo/tiger/TIGER2016/COUNTY/

The following code assumes the county shapefile is saved and unzipped.

Get and read the data

#devtools::install_github("ropensci/FedData")
library(FedData)
library(rgdal)
library(dplyr)

#Get Apache polygone
counties<- readOGR('tl_2016_us_county/tl_2016_us_county.shp')
apache <- subset(counties,counties$GEOID=="04001")

# Get NCLD data 
nlcd_data <- get_nlcd(apache,
                      year = 2011,
                      label = "Apache",
                      force.redo = TRUE)

nlcd_data #inspect the object, can see that number of cells is around 57 million

I have then extracted the values of the raster and put them into a frequency table. From there I calculate the resulting area. Since the NLCD data is 30m resolution, I multiply the number of cell of each category by 900 and divide the result by 1 million to obtain the area in Square Kilometer.

The calculated total area is too large.

# Calculating the landcover repartition in County
landcover<-data.frame(x2011 = values(nlcd_data)) #Number of rows corresponds to number of cells 
landcover_freq<-table(landcover)

df_landcover <- as.data.frame(landcover_freq)

res <- df_landcover %>%
  mutate(area_type_sqm = Freq*900,
         area_type_km=area_type_sqm/1e6,
         area_sqkm = sum(area_type_km))%>%
  group_by(landcover)%>%
  mutate(pc_land =round(100*area_type_km/area_sqkm,1))

head(arrange(res,desc(pc_land)))

# A tibble: 6 x 6
# Groups:   landcover [6]
  landcover     Freq area_type_sqm area_type_km area_sqkm pc_land
  <fct>        <int>         <dbl>        <dbl>     <dbl>   <dbl>
1 52        33455938   30110344200      30110.     51107.    58.9
2 42        16073820   14466438000      14466.     51107.    28.3
3 71         5999412    5399470800       5399.     51107.    10.6
4 31          488652     439786800        440.     51107.     0.9
5 21          362722     326449800        326.     51107.     0.6
6 22           95545      85990500         86.0    51107.     0.2

## Total area calculated from raster is 51,107 square km

apache_area <- as.data.frame(apache) %>%
  mutate(AREA=(as.numeric(ALAND)+as.numeric(AWATER))/1e6) %>%
  select(AREA)

apache_area$AREA
29055.47 #Official area of apache county (in square km)

Visual inspection of the shapefile and the raster:

The difference doesn't seem large enough to justify the difference

apache <- spTransform(apache,proj4string(nlcd_data))
plot(nlcd_data)
plot(apache,add=TRUE)

Solution

The reason is that you get the data returned in the Mercator projection.

crs(nlcd_data)
CRS arguments:
 +proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext
+no_defs

This coordinate reference system preserves shape and that is why it is used for web-mapping. It should not be used for most other purposes, as it is also notorious for distortion of area.

The take-home message is to never just trust the nominal resolution of a projected raster and assume that it is correct and/or constant. The reliable way to compute area is to use longitude/latitude coordinates, because these are, by definition, not distorted.

The reported spatial resolution is

res(nlcd_data)
[1] 30 30

So it is not surprising that you expected that the cells have an area of 30 x 30 = 900 m². But the cells sizes are actually between 573 and 625 m² for Apache county. Illustrated below.

First get the data

library(FedData)
counties <- raster::getData("GADM", country="USA", level=2)
apache <- subset(counties,counties$NAME_2=="Apache")
nlcd <- get_nlcd(apache, year = 2011, label = "nlcd_apache", force.redo = TRUE)

# move to terra
library(terra)
r <- rast(nlcd)
ap <- vect(apache)
# county boundaries to Mercator
apm <- project(ap, crs(r))

To compute the area of the grid cells I represent them as polygons. I first aggregate to get much larger cells to avoid getting too many small polygons (it would take very long, and perhaps crash R). I then transform the Mercator polygons to longitude/latitude, compute their true area, and transform back (just for consistent display purposes).

f <- 300
a <- aggregate(rast(r), f)
p <- as.polygons(a)

# compute area
g <- project(p, "+proj=longlat")
g$area <- round(expanse(g) / (f * f), 5)

# project back and plot
merc <- project(g, crs(r))
plot(merc, "area", border=NA)
lines(apm)

The map shows the approximate variation in the size of the original "900 m²" cells (between 573 and 625). This is not the case when I use the original data, as illustrated below.

library(terra)
# "filename" is the local file that has the nlcd data
usnlcd <- rast(filename)
crs(usnlcd, proj4=TRUE)
#[1] "+proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs"

res(x)
#[1] 30 30

Note that +proj=aea stands for the Albers Equal Area projection!

ap <- vect(apache)
apm <- project(ap, crs(usnlcd))
x <- crop(usnlcd, apm)

par(mfrow=c(1,2))
plot(x)
lines(apm)
# gg2 computed as above
plot(gg2, "area", border=NA)

As you can see, the cell area is indeed 900 m², with only very little distortion, so small that it can be ignored.

You could transform the Mercator data back to the original +proj=aea, but then you would have degraded the quality of the data twice. So that is a really bad idea. You could also account for the true cell area of each cell, but that is rather convoluted.

Finally, to get the area covered by each land cover class

m <- mask(x, apm)
f <- freq(m)
f <- data.frame(f)
f$area <- round(f$count * 900 / 1e6, 1)

# the next step is a bit tricky and will be done by `freq` in future versions
levs <- levels(m)[[1]]
f$class <- levs[f$value + 1]

Voila:

f[, c("class", "area")]
#                           class    area
#1                     Open Water    21.5
#2          Developed, Open Space   175.3
#3       Developed, Low Intensity    46.4
#4    Developed, Medium Intensity     9.7
#5      Developed, High Intensity     1.0
#6                    Barren Land   232.5
#7               Deciduous Forest    44.6
#8               Evergreen Forest  7604.6
#9                   Mixed Forest     2.0
#10                   Shrub/Scrub 18262.8
#11                   Herbaceuous  2514.4
#12                   Hay/Pasture     1.9
#13              Cultivated Crops     0.0
#14                Woody Wetlands    38.8
#15 Emergent Herbaceuous Wetlands    53.6

And the total is as expected

sum(f$area)
#[1] 29009.1

PS. This problem has now been solved at the source --- but I hope this answer will remain useful for others using spatial data with a Mercator CRS.