I am working on an image classification script in R using the rgdal package. The raster in question is a PCIDSK file with 28 channels: a training data channel, a validation data channel, and 26 spectral data channels. The objective is to populate a data frame containing the values of each pixel which is not zero in the training data channel, plus the associated spectral values in the 26 bands.
In Python/Numpy, I can easily import all the bands for the entire image into a multi-dimensional array, however, due to memory limitations the only option in R seems to be importing this data block by block, which is very slow:
library(rgdal)
raster = "image.pix"
training_band = 2
validation_band = 1
BlockWidth = 500
BlockHeight = 500
# Get some metadata about the whole raster
myinfo = GDALinfo(raster)
ysize = myinfo[[1]]
xsize = myinfo[[2]]
numbands = myinfo[[3]]
# Iterate through the image in blocks and retrieve the training data
column = 0
training_data = NULL
while(column < xsize){
if(column + BlockWidth > xsize){
BlockWidth = xsize - column
}
row = 0
while(row < ysize){
if(row + BlockHeight > ysize){
BlockHeight = ysize - row
}
# Do stuff here
myblock = readGDAL(raster, region.dim = c(BlockHeight,BlockWidth), offset = c(row, column), band = c(training_band,3:numbands), silent = TRUE)
blockdata = matrix(NA, dim(myblock)[1], dim(myblock)[2])
for(i in 1:(dim(myblock)[2])){
bandname = paste("myblock", names(myblock)[i], sep="$")
blockdata[,i]= as.matrix(eval(parse(text=bandname)))
}
blockdata = as.data.frame(blockdata)
blockdata = subset(blockdata, blockdata[,1] > 0)
if (dim(blockdata)[1] > 0){
training_data = rbind(training_data, blockdata)
}
row = row + BlockHeight
}
column = column + BlockWidth
}
remove(blockdata, myblock, BlockHeight, BlockWidth, row, column)
Is there a faster/better way of doing the same thing without running out of memory?
The next step after this training data is collected is to create the classifier (randomForest package) which also requires a lot of memory, depending on the number of trees requested. This brings me to my second problem, which is that creating a forest of 500 trees is not possible given the amount of memory already occupied by the training data:
myformula = formula(paste("as.factor(V1) ~ V3:V", dim(training_data)[2], sep=""))
r_tree = randomForest(formula = myformula, data = training_data, ntree = 500, keep.forest=TRUE)
Is there a way to allocate more memory? Am I missing something? Thanks...
[EDIT] As suggested by Jan, using the "raster" package is much faster; however as far as I can tell, it does not solve the memory problem as far as gathering the training data is concerned because it eventually needs to be in a dataframe, in memory:
library(raster)
library(randomForest)
# Set some user variables
fn = "image.pix"
training_band = 2
validation_band = 1
# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
training_class[training_class == 0] = NA
training_class = Which(training_class != 0, cells=TRUE)
training_data = extract(myraster, training_class)
training_data = as.data.frame(training_data)
So while this is much faster (and takes less code), it still does not solve the issue of not having enough free memory to create the classifier... Is there some "raster" package function that I have not found that can accomplish this? Thanks...
Check out the Raster package. The Raster package provides a handy wrapper for Rgdal without loading it into memory.
http://raster.r-forge.r-project.org/
Hopefully this help.
The 'raster' package deals with basic spatial raster (grid) data access and manipulation. It defines raster classes; can deal with very large files (stored on disk); and includes standard raster functions such as overlay, aggregation, and merge.
The purpose of the 'raster' package is to provide easy to use functions for raster manipulation and analysis. These include high level functions such as overlay, merge, aggregate, projection, resample, distance, polygon to raster conversion. All these functions work for very large raster datasets that cannot be loaded into memory. In addition, the package provides lower level functions such as row by row reading and writing (to many formats via rgdal) for building other functions.
By using the Raster package you can avoid filling your memory before using randomForest.
[EDIT] To solve the memory problem with randomForest maybe it helps if you could learn the individual trees within the random forest on subsamples (of size << n) rather than bootstrap samples (of size n).