Python particles simulator: out-of-core processing

Problem description

In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and Linux is important.

The number of particles (n_particles) is 10-100. The number of time-steps (time_size) is ~10^9.

The simulation has 3 steps (the code below is for an all-in-RAM version):

Simulate (and store) an emission rate array (contains many almost-0 elements):
- shape (n_particles x time_size), float32, size 80GB
Compute counts array, (random values from a Poisson process with previously computed rates):
- shape (n_particles x time_size), uint8, size 20GB
```
counts = np.random.poisson(lam=emission).astype(np.uint8)
```
Find timestamps (or index) of counts. Counts are almost always 0, so the timestamp arrays will fit in RAM.
```
# Loop across the particles
timestamps = [np.nonzero(c) for c in counts]
```

I do step 1 once, then repeat step 2-3 many (~100) times. In the future I may need to pre-process emission (apply cumsum or other functions) before computing counts.

Question

I have a working in-memory implementation and I'm trying to understand what is the best approach to implement an out-of-core version that can scale to (much) longer simulations.

What I would like it exist

I need to save arrays to a file, and I would like to use a single file for a simulation. I also need a "simple" way to store and recall a dictionary of simulation parameter (scalars).

Ideally I would like a file-backed numpy array that I can preallocate and fill in chunks. Then, I would like the numpy array methods (max, cumsum, ...) to work transparently, requiring only a chunksize keyword to specify how much of the array to load at each iteration.

Even better, I would like a Numexpr that operates not between cache and RAM but between RAM and hard drive.

What are the practical options

As a first option I started experimenting with pyTables, but I'm not happy with its complexity and abstractions (so different from numpy). Moreover my current solution (read below) is UGLY and not very efficient.

So my options for which I seek an answer are

implement a numpy array with required functionality (how?)
use pytable in a smarter way (different data-structures/methods)
use another library: h5py, blaze, pandas... (I haven't tried any of them so far).

Tentative solution (pyTables)

I save the simulation parameters in '/parameters' group: each parameter is converted to a numpy array scalar. Verbose solution but it works.

I save emission as an Extensible array (EArray), because I generate the data in chunks and I need to append each new chunk (I know the final size though). Saving counts is more problematic. If a save it like a pytable array it's difficult to perform queries like "counts >= 2". Therefore I saved counts as multiple tables (one per particle) [UGLY] and I query with .get_where_list('counts >= 2'). I'm not sure this is space-efficient, and generating all these tables instead of using a single array, clobbers significantly the HDF5 file. Moreover, strangely enough, creating those tables require creating a custom dtype (even for standard numpy dtypes):

    dt = np.dtype([('counts', 'u1')])        
    for ip in xrange(n_particles):
        name = "particle_%d" % ip
        data_file.create_table(
                    group, name, description=dt, chunkshape=chunksize,
                    expectedrows=time_size,
                    title='Binned timetrace of emitted ph (bin = t_step)'
                        ' - particle_%d' % particle)

Each particle-counts "table" has a different name (name = "particle_%d" % ip) and that I need to put them in a python list for easy iteration.

EDIT: The result of this question is a Brownian Motion simulator called PyBroMo.

Solution

PyTable Solution

Since functionality provided by Pandas is not needed, and the processing is much slower (see notebook below), the best approach is using PyTables or h5py directly. I've tried only the pytables approach so far.

All tests were performed in this notebook:

Python particles simulator: numpy out-of-core processing

Introduction to pytables data-structures

Reference: Official PyTables Docs

Pytables allows store data in HDF5 files in 2 types of formats: arrays and tables.

Arrays

There are 3 types of arrays Array, CArray and EArray. They all allow to store and retrieve (multidimensional) slices with a notation similar to numpy slicing.

# Write data to store (broadcasting works)
array1[:]  = 3

# Read data from store
in_ram_array = array1[:]

For optimization in some use cases, CArray is saved in "chunks", whose size can be chosen with chunk_shape at creation time.

Array and CArray size is fixed at creation time. You can fill/write the array chunk-by-chunk after creation though. Conversely EArray can be extended with the .append() method.

Tables

The table is a quite different beast. It's basically a "table". You have only 1D index and each element is a row. Inside each row there are the "columns" data types, each columns can have a different type. It you are familiar with numpy record-arrays, a table is basically an 1D record-array, with each element having many fields as the columns.

1D or 2D numpy arrays can be stored in tables but it's a bit more tricky: we need to create a row data type. For example to store an 1D uint8 numpy array we need to do:

table_uint8 = np.dtype([('field1', 'u1')])
table_1d = data_file.create_table('/', 'array_1d', description=table_uint8)

So why using tables? Because, differently from arrays, tables can be efficiently queried. For example, if we want to search for elements > 3 in a huge disk-based table we can do:

index = table_1d.get_where_list('field1 > 3')

Not only it is simple (compared with arrays where we need to scan the whole file in chunks and build index in a loop) but it is also very extremely fast.

How to store simulation parameters

The best way to store simulation parameters is to use a group (i.e. /parameters), convert each scalar to numpy array and store it as CArray.

Array for "`emission`"

emission is the biggest array that is generated and read sequentially. For this usage pattern A good data structure is EArray. On "simulated" data with ~50% of zeros elements blosc compression (level=5) achieves 2.2x compression ratio. I found that a chunk-size of 2^18 (256k) has the minimum processing time.

Storing "`counts`"

Storing also "counts" will increase the file size by 10% and will take 40% more time to compute timestamps. Having counts stored is not an advantage per-se because only the timestamps are needed in the end.

The advantage is that recostructing the index (timestamps) is simpler because we query the full time axis in a single command (.get_where_list('counts >= 1')). Conversely, with chunked processing, we need to perform some index arithmetics that is a bit tricky, and maybe a burden to maintain.

However the the code complexity may be small compared to all the other operations (sorting and merging) that are needed in both cases.

Storing "`timestamps`"

Timestamps can be accumulated in RAM. However, we don't know the arrays size before starting and a final hstack() call is needed to "merge" the different chunks stored in a list. This doubles the memory requirements so the RAM may be insufficient.

We can store as-we-go timestamps to a table using .append(). At the end we can load the table in memory with .read(). This is only 10% slower than all-in-memory computation but avoids the "double-RAM" requirement. Moreover we can avoid the final full-load and have minimal RAM usage.

H5Py

H5py is a much simpler library than pytables. For this use-case of (mainly) sequential processing seems a better fit than pytables. The only missing feature is the lack of 'blosc' compression. If this results in a big performance penalty remains to be tested.