Search code examples
rgisrasterr-raster

Dealing with NAs when extracting data from rasters stack by points coordinates


I'm working with datapoints of presence of diferent species. I create with R random points of absence and presence. I stack a bunch of different rasters such as elevation, orientation, geological data, and climatic variables from BioClim. My problem comes when I extract the information of the rasters with my presence/absence points, since some points land on invalid cells of some rasters and NAs are atributted. I would like to know how to change this NAs to the nearest possible valid value.

Thanks in advance.

I tried using for and if loops but i was unsuccesful in doing it. I did some talking with ChatGPT but it suggested me to interpolate the data with na.approx and similar or adding parameters to the extract function from the package raster such as methor="bilinear" or some useless functions.

new code with Terra

My current workflow (25/09):

I load my shapefiles with terra::vect and my rasters with terra:rast. I stack my rasters with c() and then:

#Create a buffer to avoid pseudo-absence points to be place there
buf <- buffer(Species_data, 1000) #1 km radius
#Differencia buffer and landmass vectors
Background_area <- erase(Land_polygon, aggregate(buf))

#Generate the pseudo-absence points
Random_points <- terra::spatSample(Background_area, size = 2.5*nrow(Species_data), method = "random")

#Create a Presence/absence variable and merge both vectors of points
Random_points$Presence <- 0
Species_data$Presence <- 1

spg_all <- union(Random_points, Species_data)

#Extract stack raster values for each vector point
extracted_predictors_raw <- terra::extract(raster_stack, spg_def, xy=T)

I still want to find the nearest proper value (not a NA) for the points that land outside some raster pixel due to its resolution (especially BioClim). I read some papers that use kNN for that, but any solution will work for me.

old code

Workflow (23/09):

I load my rasters and shapefile the functions "raster" and "readOGR" and define my extension. Then I check for all the raster and points to be align and I stack all my raster with "stack" function. I generate my absence points using "spsample", and I create my points file as a SpatialPointsDataFrame. Then:

extracted_predictors_raw=data.frame(raster::extract(stack,points))%>%
 left_join(Legend_geo,by=c("Geology"="ORDRE"))%>%
 select(-Geologia,-extracted_predictors_raw$Legend_geo,-LEGEND)

spg_data=cbind(points@data,points@coords,extracted_predictors_raw)

coordinates(spg_data)=c("lon","lat")

NA_SPG=na.omit(spg_data@data)

mtry <- tuneRF(NA_SPG[,c(-Presence)],as.factor(NA_SPG$Presence), ntreeTry=500,
               stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]

rf1 <- randomForest(as.factor(Presence) ~.,
                    data=NA_SPG, 
                    mtry=best.m[1],
                    na.action=na.omit,
                    replace = TRUE,
                    ntree=10000)

So I would like insted of removing NAs, to take the value from the valid cell closest to the point. I read some papers when they deal with it by doing a kNN, but I'm new in working with spatial data on R.


Solution

  • There are different solutions here:

    1. limiting the extraction to non-NA values in the raster stack (which seems to me the best one)
    2. remove points where NA occurs from the database
    3. based on the raster resolution and the occurrence of NA, extract raster values in a buffer around each point and take the mean values (or median, or whatever you need). This do not guarantee the absence of NA.

    The solution really depends on your needs. If you need a precise number of presence/absence points, the 1) and 2) solutions should be adapted to reach the number of points you need. Just a simple example of the first solution

    library(terra)
    set.seed(4)
    f <- system.file("ex/elev.tif", package="terra")
    r <- rast(f)
    #create raster stack
    rr <- c(r,r*2,r/2,log(r))
    ncell <- ncol(r)*nrow(r)
    #put some random NAs 
    for (i in 1:nlyr(rr)) {
      rr[[i]][sample(1:ncell,ncell/2,replace=F)] <- NA
    }
    plot(rr)
    #sample excluding NA values
    pnt <- spatSample(rr,500,"random",as.points=T,na.rm=T)
    summary(pnt)
    

    If you want a fixed number of points, you can sample till you reach the number you want:

    #calculate the maximum number of finite values in the stack
    max_point <- sum(apply(is.finite(values(rr)),1,sum)==4)ù
    #sample 50% of the total area
    npoint <- max_point/2
    pnt<- spatSample(rr,npoint,"random",as.df=T,na.rm=T,xy=T)
    
    while (nrow(pnt)!=npoint) {
      x <- spatSample(rr,npoint,"random",as.df=T,na.rm=T,xy=T)
      pnt <- rbind(pnt,x)
      pnt <- dplyr::distinct(pnt)
      if (nrow(pnt)>npoint) {
        pnt <- pnt[1:npoint,]
        break
      }
    }
    

    P.S. Since spatSample has the argument values=T you do not need to extract the raster values after the sample (i.e., do not need extract(raster,point,...)

    EDIT

    The problem with the KNN imputation is that the nearest neighbor is searched in the covariate space, not in the geographical space, and if you have pixels that are NA for each predictor, you cannot use KNN for those pixels. Another approach could be a bagged tree imputation( see https://topepo.github.io/caret/pre-processing.html#imputation). After extracting random points with spatSample you can do something like this:

    library(caret)
    
    pnt <- spatSample(rr,500,"random",as.df=T,xy=T)
    pp <- preProcess(pnt[,-c(1:2)],method="bagImpute")
    new_pnt <- predict(pp,pnt[,-c(1:2)])
    new_pnt <- cbind(pnt[,1:2],new_pnt)
    

    However, I personally think that it is better to sample real values instead of impute them. The while loop seems a reasonable compromise.