Search code examples
rrandom-forestrasterpredict

Generating prediction raster from Random Forest model using R?


I fit a Random Forest model to tabular data from test sites in R, and now would like to generate a raster showing predicted probability values using raster data corresponding to the same predictors (e.g., slope, elevation, pH) that are in the model.

The RF model is built to predict a 0/1 binary variable SITE_NONSITE using different environmental and geophysical data.

#random forest model
set.seed(321)
rf1 <- randomForest(formula=SITE_NONSITE ~., data=dcc.s.dummy, ntree=500, mtry=10)

dcc.s.dummy includes the following data:

str(dcc.s.dummy)
'data.frame':   7899 obs. of  25 variables:
 $ COST_DIST_ECOTONE        : num  -0.232 0.176 -0.443 -0.478 -0.305 ...
 $ COST_DIST_HEA            : num  -0.233 -0.659 -1.055 -0.999 -0.455 ...
 $ COST_DIST_MEDSTR         : num  0.74388 0.63933 0.55964 0.50768 0.00993 ...
 $ COST_DIST_RIV_COAST      : num  0.59 0.63 0.621 0.639 0.617 ...
 $ DEM30_ASP_RE_2           : num  0 0 0 0 1 0 0 0 0 0 ...
 $ DEM30_ASP_RE_3           : num  0 1 0 0 0 0 0 0 1 0 ...
 $ DEM30_ASP_RE_4           : num  1 0 0 0 0 0 0 1 0 0 ...
 $ DEM30_ASP_RE_5           : num  0 0 1 1 0 1 1 0 0 1 ...
 $ DEM30_M                  : num  0.916 0.72 0.499 0.54 1.114 ...
 $ DEM30_SLOPE              : num  0.2063 0.4631 -0.6445 -0.0512 -0.8235 ...
 $ LOC_REL_RE               : num  -0.489 -0.476 -0.476 -0.459 -0.661 ...
 $ LOC_SD_SLOPE             : num  -0.118 -0.135 -0.316 -0.367 -0.57 ...
 $ SSURGO_ESRI_DRAINAGE_RE_2: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_3: num  1 1 1 1 1 1 1 1 1 1 ...
 $ SSURGO_ESRI_DRAINAGE_RE_4: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_5: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_DRAINAGE_RE_6: num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_EROSION_RE_2 : num  0 0 0 0 0 1 1 0 0 1 ...
 $ SSURGO_ESRI_EROSION_RE_3 : num  1 1 1 0 1 0 0 1 1 0 ...
 $ SSURGO_ESRI_EROSION_RE_4 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ SSURGO_ESRI_LOC_DIV      : num  -0.328 -0.188 -0.157 -0.213 -0.652 ...
 $ SSURGO_ESRI_NATIVEVEG_2  : num  1 1 1 0 1 0 0 1 1 1 ...
 $ SSURGO_ESRI_NATIVEVEG_3  : num  0 0 0 0 0 1 1 0 0 0 ...
 $ SSURGO_PH                : num  0.813 0.059 1.529 2.32 -1.298 ...
 $ SITE_NONSITE             : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 2

I then take rasters corresponding to these same predictors across my entire study area and combine them into a raster stack:

#plot model predictions
COST_DIST_ECOTONE <- raster("cost_dist_ecotone_s.tif.tif")
COST_DIST_HEA <- raster("cost_dist_hea_s.tif.tif")
COST_DIST_MEDSTR <- raster("cost_dist_medstr_s.tif.tif")
COST_DIST_RIV_COAST <- raster("cost_dist_riv_coast_s.tif.tif")
DEM30_ASP_RE_2 <- raster("dem30_asp_rel_2.tif.tif")
DEM30_ASP_RE_3 <- raster("dem30_asp_rel_3.tif.tif")
DEM30_ASP_RE_4 <- raster("dem30_asp_rel_4.tif.tif")
DEM30_ASP_RE_5 <- raster("dem30_asp_rel_5.tif.tif")
DEM30_M <- raster("dem30_m_s.tif.tif")
DEM30_SLOPE <- raster("dem30_slope_s.tif.tif")
LOC_REL_RE <- raster("loc_rel_re_s.tif.tif")
LOC_SD_SLOPE <- raster("loc_sd_slope_s.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_2 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_2.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_3 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_3.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_4 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_4.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_5 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_5.tif.tif")
SSURGO_ESRI_DRAINAGE_RE_6 <- raster("SSURGO_ESRI_drainage_reclass_nulfill_6.tif.tif")
SSURGO_ESRI_EROSION_RE_2 <- raster("SSURGO_ESRI_erosion_reclass_nulfilll_2.tif.tif")
SSURGO_ESRI_EROSION_RE_3 <- raster("SSURGO_ESRI_erosion_reclass_nulfilll_3.tif.tif")
SSURGO_ESRI_EROSION_RE_4 <- raster("SSURGO_ESRI_erosion_reclass_nulfilll_4.tif.tif")
SSURGO_ESRI_LOC_DIV <- raster("SSURGO_ESRI_loc_div_s.tif.tif")
SSURGO_ESRI_NATIVEVEG_2 <- raster("SSURGO_ESRI_nativeveg_nullfill_2.tif.tif")
SSURGO_ESRI_NATIVEVEG_3 <- raster("SSURGO_ESRI_nativeveg_nullfill_3.tif.tif")
SSURGO_PH <- raster("SSURGO_pH_nullfill_s.tif.tif")

ApPl_stack <- stack(COST_DIST_ECOTONE, COST_DIST_HEA, COST_DIST_MEDSTR, COST_DIST_RIV_COAST, DEM30_ASP_RE_2, DEM30_ASP_RE_3, DEM30_ASP_RE_4, DEM30_ASP_RE_5, DEM30_M, DEM30_SLOPE, LOC_REL_RE, LOC_SD_SLOPE, SSURGO_ESRI_DRAINAGE_RE_2, SSURGO_ESRI_DRAINAGE_RE_3, SSURGO_ESRI_DRAINAGE_RE_4, SSURGO_ESRI_DRAINAGE_RE_5, SSURGO_ESRI_DRAINAGE_RE_6, SSURGO_ESRI_EROSION_RE_2, SSURGO_ESRI_EROSION_RE_3, SSURGO_ESRI_EROSION_RE_4, SSURGO_ESRI_LOC_DIV, SSURGO_ESRI_NATIVEVEG_2, SSURGO_ESRI_NATIVEVEG_3, SSURGO_PH)

However, trying to use this raster stack ApPl_stack in raster::predict() fails with the following error:

ApPl_prob <- raster::predict(rf1, newdata=ApPl_stack, type="prob")

Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ‘structure("RasterLayer", package = "raster")’ to a data.frame

Converting to a data frame and using that instead generates this error instead:

ApPl_df <- as.data.frame(ApPl_stack, xy=TRUE)
ApPl_prob <- raster::predict(rf1, newdata=ApPl_df, type="prob")

Error in model.frame.default(Terms, newdata, na.action = na.omit) :
object is not a matrix In addition: Warning message: 'newdata' had 658242 rows but variables found have 754 rows

It can't be a coincidence that there are 658242 cells and 754 rows in each of my predictor rasters. What am I missing here? I feel like one of the functions is expecting a data type it's not getting.


Solution

  • The "object names" have nothing to do with the layer names, so you need to set these to match the names in the data.frame used to fit the model. In most workflows you would do something like

    f <- c("cost_dist_ecotone_s.tif.tif", "cost_dist_hea_s.tif.tif", "cost_dist_medstr_s.tif.tif")
    s <- stack(f)
    names(s) <- gsub(".tif.tif", "", f)
    

    And then extract values from the RasterStack to fit your model --- in that case the names already match.

    But the main mistake you made was here

    ApPl_prob <- raster::predict(rf1, newdata=ApPl_stack, type="prob")
    

    The first argument should be the RasterStack:

    ApPl_prob <- raster::predict(ApPl_stack, rf1, type="prob")
    

    Or use named parameters as you did in your answer

    raster::predict(model=rf1, object=ApPl_stack, type="prob")