Search code examples
rgeospatialr-sfterra

Using terra and sf in R: Why am I getting illogical distance measurements?


I am using terra to get "curvy" distances between points within a bounding polygon and comparing those to straight-line distances that ignore the polygon. The results I'm getting back don't make sense, and I am hoping you all could help me figure out what is going on.

We load the US Congressional map used in the 114th Congress for the state of Texas first:

texas = dget("https://raw.githubusercontent.com/BajczA475/random-data/main/texascongressmaps")
ggplot() + geom_sf(data = texas$geometry)

We also make some storage objects:

longest.dist.district.straight = rep(NA, 36)
longest.dist.district.curved = rep(NA, 36)

Then, we go district by district (n = 36). For each, we take a sample of 100 random points within that district's polygon. Then, we ask "What is the longest straight-line distance between any two of our 100 points?" We then rasterize the polygon, mask it, and go point by point, asking "How far is this point from all others, assuming we cannot travel outside the polygon?" This means we'll have to bend around within the polygon to get between the points some of the time. We find the longest such distance between any two points. We then compare the straight-line and curvy-line approaches, with the assumption that the curvy-line approaches will always be longer by some amount...

for(c in 1:36) { #Texas had 36 districts.
if(c %% 3 == 0) {print(c)} # Progress bar

this.district = texas[c, ] #Get the current district

#We'll get a sample of 100 randomly placed points around the district.
rand.ptsDistrict = sf::st_sample(this.district,
size = 100,
type = 'random',
exact = TRUE)

#What's the max straight-line distance between any two points?
longest.dist.district.straight[c] = max(sf::st_distance(rand.ptsDistrict))

#Now, calculate our 'as the politician would walk' distances (aka curvy distances). We need to do this for each of our 100 points separately, with each as the target point in turn, and save the longest value we get...
current.raster = terra::ext(this.district) # Rasterizing
current.raster = terra::rast(current.raster,
nrow=100, ncol=100,
crs = crs(this.district),
vals = 1)
current.raster = terra::mask(current.raster, # Masking
terra::vect(this.district),
updatevalue = NA)
point.locs = terra::cellFromXY(current.raster, # Getting point locations in the new grid
sf::st_coordinates(rand.ptsDistrict))

longest.dists.i = rep(NA, 100) # Storage object
for(i in 1:100) {
point.i.loc = cellFromXY(current.raster, #Focal point this time.
st_coordinates(rand.ptsDistrict[i]))
point.noni.loc = cellFromXY(current.raster, #All other points
st_coordinates(rand.ptsDistrict[-i]))
terra::values(current.raster)[point.i.loc] = 2 # Make focal point the target value
all.dists = terra::gridDistance(current.raster, #Get all distances to the target value
target = 2, scale = 1)
longest.dists.i[i] = max(values(all.dists)[point.noni.loc], na.rm=TRUE) # Find the longest of these for this point and store it.
terra::values(current.raster)[point.i.loc] = 1
}
longest.dist.district.curved[c] = max(longest.dists.i) # Find the longest curved distance between any two points in the current district.
}

When I do this, I always get straight-line distances that are strictly longer than the curvy distances from the same district, which doesn't logically make sense--how could a straight line between two points ever be longer than a curvy line between them?

> (cbind(longest.dist.district.straight, longest.dist.district.curved))
      longest.dist.district.straight longest.dist.district.curved
 [1,]                      239285.77                    121703.64
 [2,]                       63249.88                     48238.89
 [3,]                       49495.09                     24823.91
 [4,]                      290542.38                    147894.80
 [5,]                      213758.13                    108663.63
 [6,]                      129261.83                     68351.77
 [7,]                       36705.18                     22081.22
 [8,]                      165759.58                     87749.33
 [9,]                       38317.61                     19903.54
[10,]                      196211.38                    100959.66
[11,]                      505130.81                    261479.58
[12,]                       79502.87                     45134.11
[13,]                      604901.43                    313317.24
[14,]                      201724.57                    115286.81
[15,]                      414257.14                    208204.75
[16,]                       61867.34                     32115.77
[17,]                      193198.96                    103829.75
[18,]                       41693.26                     26462.02
[19,]                      433902.07                    225041.00
[20,]                       32201.45                     17060.41
[21,]                      212300.45                    119597.54
[22,]                       88143.49                     46720.59
[23,]                      777236.95                    394663.54
[24,]                       39692.06                     21192.98
[25,]                      299336.81                    153871.46
[26,]                       65901.64                     35200.83
[27,]                      272822.43                    158724.70
[28,]                      362477.84                    205297.74
[29,]                       40210.19                     30094.43
[30,]                       44693.37                     23430.33
[31,]                       93781.16                     50340.85
[32,]                       38941.81                     21047.40
[33,]                       52395.85                     31169.46
[34,]                      394586.71                    206545.50
[35,]                      138182.61                     73556.10
[36,]                      223351.15                    112601.38

I can only guess I have either messed up the code somewhere or else have found a bug. Please help! Thanks!

Edit: I just noticed after posting this that it looks like if I were to multiply the curvy distances by 2, I'd get values that were believable (the curvy distances are always longer but by a variable amount)--but I don't see a coding reason to need to do this...can anyone else see one I'm missing?


Solution

  • You are comparing the shortest-distance ("as the crow flies" to those who have not seen crows fly) with the grid-distance (move from the center of a cell to the center of a neighboring cell), only allowing to use the grid cells that fall within a district.

    When I run a condensed version of your code, I see that the distances are very similar, with the grid distance always longer, as they should be, except for district 14 as that district is not contiguous.

    library(terra)
    #terra 1.6.47
    
    texas <- dget("https://raw.githubusercontent.com/BajczA475/random-data/main/texascongressmaps")
    tex <- vect(texas)
    
    # generate random points
    set.seed(0)
    b <- spatSample(tex[, "DISTRICT"], size = 100, method="random", strata=1:nrow(tex))
    
    # max distance between any two random points by district.
    pdist <- sapply(tex$DISTRICT, \(i) max( distance( b[b$DISTRICT == i, ])) )
    
    # max grid distance between any two random points by district.
    pgrid <- rep(NA, nrow(tex))
    for (i in 1:nrow(tex)) {
        r <- rast(tex[i,], nrow=100, ncol=100)
        r <- rasterize(tex[i,], r)
        xy <- crds(b[b$DISTRICT==i, ])
        cells <- cellFromXY(r, xy)
        maxdists <- rep(NA, 100)
        for(j in 1:100) {
            r[cells[j]] <- 2
            dists <- gridDist(r, target=2)
            # Find the longest of these for this point
            maxdists[j] <- max( dists[ cells[-j] ], na.rm=TRUE)
            r[cells[j]] <- 1
        }
        pgrid[i] <- max(maxdists) 
    }
    

    The results look good:

    head(cbind(pdist, pgrid))
    #      pdist     pgrid
    #1 217746.46 223906.22
    #2  61707.87  99422.07
    #3  50520.61  51479.98
    #4 282744.13 293656.59
    #5 196074.08 202014.45
    #6 120913.60 126532.72
    
    plot(pdist, pgrid)
    abline(0, 1, col="red")
    

    enter image description here

    If your results are different you are perhaps using an older version of "terra"? I assume you are because you are using gridDistance which works with a warning because it was renamed to gridDist in the current version.

    You use different grid cell sizes for each district. I do not know what your goal is, but it might be more reasonable to use a single template raster for all of Texas. You could do something like

    # outside the loop
    rr <- rast(tex, res=1/60, vals=1)
    # inside the loop
    r <- crop(rr, tex[i,], mask=TRUE)