Search code examples
rrastertiffspdf

How can I obtain cell values and coordinate data from a (.tif) raster when R rasterToPoints function is not working?


I am interested in extracting the cell values alongside their corresponding x and y coordinates from a tif file accessible from the WorldPop database [ https://hub.worldpop.org/geodata/summary?id=49920 ].

I have converted this file alongside other tif files available on this website into rasters and then used the rasterToPoints function in R to extract this information. However, although this approach has worked for most of the files, it has failed for this particular file amongst a few others. It's like the R session remains stuck and the code just never runs when I attempt to convert the rasters to spdf data.

library(raster)
Raster <- raster("C:/file path/aus_ppp_2020_UNadj_constrained.tif")
Raster <- rasterToPoints(Raster, spatial = TRUE)

As an alternative, I thought I could extract the coordinates after obtaining the cell values using getValues() or readAll() but due to the size of the raster being too large I run into the following error:

Error: cannot allocate vector of size 17.8 Gb.

sessionInfo()
# R version 4.2.0 (2022-04-22 ucrt)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 22000)

library(memuse)
memuse::Sys.meminfo()
# Totalram:  31.781 GiB 
# Freeram:   26.164 GiB 

I then tried to see if I could modify the usable memory using memory.limit() but this code has been retired from R version 4.2 and I cannot find an alternative.

memory.limit() 
# Warning: 'memory.limit()' is no longer supported[1] Inf

I was wondering if anyone knows:

1. If there is a way I could get the rasterToPoints function to work for this raster.

2. If there is a way to subset the raster to smaller rasters, while retaining all data, so that I could use the rasterToPoints function on each subset and then merge the resulting spatial point dataframes.

3. If there is an alternative way to extract the coordinates alongside the cell values for this tif file.

Any help is greatly appreciated.


Solution

  • Although it seems like an overall questionable idea to break an efficient storage apart in general - it's not like the relation between values and coordinates is lost in a raster - there is a pretty convenient way to do so using terra.

    However, this case seems to be an exception to the general rule since your raster is huge - consisting of 55,075 rows and 74,928 columns - with only 2,910,517 values being not NA according to global(r, "notNA"). To put it in other words, 99.9 % of the values of your raster are NA, so I get your point trying to convert this.

    Unfortunately, my first attempt had some flaws as the as.polygons(r) |> as.points() |> as.data.frame(geom = "XY") pipe only returned a result because of the dissolve = TRUE default reducing the number of objects. Moreover, the output consisted of the nodes of the polygons dissolved, so I'd really like to correct my answer making use of makeTiles() what seems to be an appropriate way when dealing with huge raster data and facing std::bad_alloc errors.

    Read your raster r and create a second raster t which is used for tiling using the extent and crs of r but with significantly higher resolution:

    library(terra)
    #> terra 1.6.17
    
    r <- rast("aus_ppp_2020_UNadj_constrained.tif")
    r
    #> class       : SpatRaster 
    #> dimensions  : 55075, 74928, 1  (nrow, ncol, nlyr)
    #> resolution  : 0.0008333333, 0.0008333333  (x, y)
    #> extent      : 96.81625, 159.2562, -55.11708, -9.22125  (xmin, xmax, ymin, ymax)
    #> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
    #> source      : aus_ppp_2020_UNadj_constrained.tif 
    #> name        : aus_ppp_2020_UNadj_constrained 
    #> min value   :                          0.000 
    #> max value   :                       1441.356
    
    t <- rast(ext = ext(r), crs = crs(r), resolution = res(r) * 4000)
    t
    #> class       : SpatRaster 
    #> dimensions  : 14, 19, 1  (nrow, ncol, nlyr)
    #> resolution  : 3.333333, 3.333333  (x, y)
    #> extent      : 96.81625, 160.1496, -55.11708, -8.450416  (xmin, xmax, ymin, ymax)
    #> coord. ref. : lon/lat WGS 84 (EPSG:4326)
    
    makeTiles(r, t)
    

    As a result, you'll find 14x19 tiles - this process takes some minutes - named tile_*.tif in your working directory. Maybe there is a more elegant way to solve this without writing the subsets of your raster to disk, but on the other hand, you want to prevent RAM overflow, so writing to disk (temporarily) might be quite appropriate. And maybe there is also no necessity to split your raster in 266 tiles - one could check for the critical limit of values per raster additionally, I just used a random value per trial and error - to speed this up a little bit. However, if time is not critical, this workflow should at least produce proper results.

    # list filenames of your tiles
    tiles <- list.files(pattern = "tile_[0-9]+.tif", full.names = TRUE)
    
    n <- tiles |> length()
    
    # init an empty data frame
    result <- data.frame()
    
    # iterate over all files; 
    # read tile, write values (if present) as data frame, gather results
    for (i in 1:n) {
      
      rt <- rast(tiles[i])
      
      if (global(rt, "notNA") > 0) {
        
        x <- rt |> as.points() |> as.data.frame(geom = "XY")
        
        result <- rbind(result, x)
      }
    }
    
    # inspect result
    head(result)
    #> aus_ppp_2020_UNadj_constrained        x         y
    #> 1                      0.2376212 130.0342 -11.75917
    #> 2                      0.2404007 130.0350 -11.75917
    #> 3                      0.2336724 130.0358 -11.75917
    #> 4                      1.3149673 113.2033 -26.00583
    #> 5                      1.2930205 113.2033 -26.00667
    #> 6                      1.2527549 113.1792 -26.11167
    
    dim(result)
    #> [1] 2910517       3
    
    # save data frame encompassing all values as csv file
    write.csv(result, "aus_ppp_2020_UNadj_constrained.csv", row.names = FALSE)
    

    Since result consists of 2,910,517 rows, I'd estimate this approach as robust.