Search code examples
pythonnumpymemoryrastercdf

Python-Read very large raster and plot empirical cumulative distribution function, memory error


I'm trying to plot an empirical cumulative distribution function (CDF) of data from a 380Gb binary raster. Using just a small mask of the data, the following code works perfectly.

import numpy as np
import matplotlib.pyplot as plt
dem_name = open('./raster.dem','rb')
vals = np.fromfile(dem_name,dtype='float32')
vals = np.negative(vals[vals!=-9999])
vals = np.sort(vals)
y = np.arange(1.,len(vals)+1.)/len(vals)
plt.plot(vals,y)

However, when I try to load the whole raster using this code, it obviously gives a memory error. My computer has 9Tb of disk space but is limited to 16Gb of RAM, so I have used numpy.memmap to get the raster values into an array.

dem_name = open('./raster.dem','rb')
vals = np.memmap(dem_name,dtype='float32','r')

This works, but I need to trim the nodata values (-9999) from the raster, switch the sign of the values (negative values becomes positive) and sort the values from lowest to highest.

vals_real = np.memmap(np.sort(np.negative(vals[vals!=-9999])))

This runs for a few hours and then gives a memory error.

The y array,

y = np.arange(1.,len(vals)+1.)/len(vals)

is also too big to be stored in RAM (gives a memory error), but I can't figure out how to store the array as a memmap object.

Is it correct that in order to plotting also takes memory, such that I will need enough disk space for 2X the size of the raster file ( 2x 380Gb)?

So to summarize, I need to read the huge raster into python and plot a CDF. It's very simple with a small raster, but I've been unsuccessful making this plot with the full raster.

I hope this question is clear. Thanks in advance.


Solution

  • With 380Gb of single precision floats, you have about 95 billion values. Don't attempt to plot the ECDF using all 95 billion values. Most plotting software can't handle that many points, and even if it could, most displays are only a few thousand pixels wide, so there is no point in plotting data with resolution much higher than that.

    Instead, compute a histogram, and work in batches. If you already know reasonable lower and upper bounds for the values in the file, you can preallocate the histogram bins. Otherwise, you might need a histogram algorithm that can adapt to the new data that arrives in each batch.