Search code examples
rperformancegisraster

R - Efficiently create dataframe from large raster excluding NA values


apologies for cross-posting something similar in the GIS stack.

I am looking for a more efficient way to create a frequency table based on a large raster in R.

Currently, I have a few dozen rasters, ~ 150 million cells in each, and I need to create frequencies tables for each. These rasters are derived from masking a base raster with a few hundred small sampling locations*. Therefore the rasters I am creating the tables from contain ~99% NA values.

My current working approach is this:

    sampling_site_raster <- raster("FILE")
    base_raster <- raster("FILE")

    sample_raster <- mask(base_raster, sampling_site_raster)

    DF <- as.data.frame(freq(sample_raster, useNA='no', progress='text'))

    ### run time for the freq() process ###
    user  system elapsed 
    162.60    4.85  168.40

this uses the freq() function from the raster package of R. The usaNA=no flag will dump the NA values.

My questions are:

1) is there a more efficient way to create a frequency table from a large raster that is 99% NA values? or 2) is the a more efficient way to derive the values from the base raster than by using mask()? (using the Mask GP function in ArcGIS is very fast, but still has the NA values and is an extra step

*additional info: The sample areas represented by sampling_site_raster are irregular shapes of various sizes spread randomly across the study area. In the sampling_site_raster the sampling sites are encoded as 1 and non-sampling areas as NA.

Thank you!


Solution

  • If you mask the raster by raster, you will always get another huge raster. I don't think this is a way to make things faster.

    What I would do is to try to mask by polygon layer using extract:

    res <- extract(raster, polygons)
    

    Then you will have all the cell values for each polygon and can run freq on them.