I am working on performing image processing using Numpy, specifically a running standard deviation stretch. This reads in X number of columns, finds the Std. and performs a percentage linear stretch. It then iterates to the next "group" of columns and performs the same operations. The input image is a 1GB, 32-bit, single band raster which is taking quite a long time to process (hours). Below is the code.
I realize that I have 3 nested for loops which is, presumably where the bottleneck is occurring. If I process the image in "boxes", that is to say loading an array that is [500,500] and iterating through the image processing time is quite short. Unfortunately, camera error requires that I iterate in extremely long strips (52,000 x 4) (y,x) to avoid banding.
Any suggestions on speeding this up would be appreciated:
def box(dataset, outdataset, sampleSize, n):
quiet = 0
sample = sampleSize
#iterate over all of the bands
for j in xrange(1, dataset.RasterCount + 1): #1 based counter
band = dataset.GetRasterBand(j)
NDV = band.GetNoDataValue()
print "Processing band: " + str(j)
#define the interval at which blocks are created
intervalY = int(band.YSize/1)
intervalX = int(band.XSize/2000) #to be changed to sampleSize when working
#iterate through the rows
scanBlockCounter = 0
for i in xrange(0,band.YSize,intervalY):
#If the next i is going to fail due to the edge of the image/array
if i + (intervalY*2) < band.YSize:
numberRows = intervalY
else:
numberRows = band.YSize - i
for h in xrange(0,band.XSize, intervalX):
if h + (intervalX*2) < band.XSize:
numberColumns = intervalX
else:
numberColumns = band.XSize - h
scanBlock = band.ReadAsArray(h,i,numberColumns, numberRows).astype(numpy.float)
standardDeviation = numpy.std(scanBlock)
mean = numpy.mean(scanBlock)
newMin = mean - (standardDeviation * n)
newMax = mean + (standardDeviation * n)
outputBlock = ((scanBlock - newMin)/(newMax-newMin))*255
outRaster = outdataset.GetRasterBand(j).WriteArray(outputBlock,h,i)#array, xOffset, yOffset
scanBlockCounter = scanBlockCounter + 1
#print str(scanBlockCounter) + ": " + str(scanBlock.shape) + str(h)+ ", " + str(intervalX)
if numberColumns == band.XSize - h:
break
#update progress line
if not quiet:
gdal.TermProgress_nocb( (float(h+1) / band.YSize) )
Here is an update: Without using the profile module, as I did not want to start wrapping small sections of the code into functions I used a mix of print and exit statements to get a really rough idea about which lines were taking the most time. Luckily (and I do understand how lucky I was) one line was dragging everything down.
outRaster = outdataset.GetRasterBand(j).WriteArray(outputBlock,h,i)#array, xOffset, yOffset
It appears that GDAL is quite inefficient when opening the output file and writing out the array. With this in mind I decided to add my modified arrays "outBlock" to a python list, then write out chunks. Here is the segment that I changed:
The outputBlock was just modified ...
#Add the array to a list (tuple)
outputArrayList.append(outputBlock)
#Check the interval counter and if it is "time" write out the array
if len(outputArrayList) >= (intervalX * writeSize) or finisher == 1:
#Convert the tuple to a numpy array. Here we horizontally stack the tuple of arrays.
stacked = numpy.hstack(outputArrayList)
#Write out the array
outRaster = outdataset.GetRasterBand(j).WriteArray(stacked,xOffset,i)#array, xOffset, yOffset
xOffset = xOffset + (intervalX*(intervalX * writeSize))
#Cleanup to conserve memory
outputArrayList = list()
stacked = None
finisher=0
Finisher is simply a flag that handles the edges. It took a bit of time to figure out how to build an array from the list. In that, using numpy.array was creating a 3-d array (anyone care to explain why?) and write array requires a 2d array. Total processing time is now varying from just under 2 minutes to 5 minutes. Any idea why the range of times might exist?
Many thanks to everyone who posted! The next step is to really get into Numpy and learn about vectorization for additional optimization.
Answered as requested.
If you are IO bound, you should chunk your reads/writes. Try dumping ~500 MB of data to an ndarray, process it all, write it out and then grab the next ~500 MB. Make sure to reuse the ndarray.