Search code examples
pythonnumpy

numpy convention for storing time series of vectors and matrices - items in rows or columns?


I'm working with discrete-time simulations of ODEs with time varying parameters. I have time series of various data (e.g. time series of state vectors generated by solve_ivp, time series of system matrices generated by my control algorithm, time series of system matrices in modal form, and so on).

My question: in what order should I place the indices? My intuition is that since numpy arrays are (by default) stored in row-major order, and I want per-item locality, each row should contain the "item" (i.e. a vector or matrix), and so the number of rows is the number of time points, and the number of columns is the dimension of my vector, e.g.:

x_k = np.array((5000, 4)) # a time series of 5000, 4-vectors
display(x_k[25]) # the 26th timepoint

Or for matrices I might use:

A_k = np.array((5000, 4, 4)) # a time series of 5000, 4x4-matrices

However, solve_ivp appears to do the opposite and returns a row-major array with the time series in columns (sol.y shape is (4, 5000)). Furthermore, transposing the result with .T just flips a flag to column-major so it is not really clear what the developers of solve_ivp and numpy intend me to do to write cache efficient code.

What are the conventions? Should I use the first index for the time index, as in my examples above, or last index as solve_ivp does?


Solution

  • This is strongly dependent of the algorithms applied on your dataset. This problem is basically known as AoS versus SoA. For algorithm that does not benefit much from SIMD operations and accessing all fields, AoS can be better, otherwise SoA is often better. The optimal data structure is often AoSoA, but it is often a pain to manipulate (so it is rare in Numpy codes). On top of that, Numpy is not efficient to operate on arrays having a very small last axis because of the way it is currently implemented (more specifically because internal generators, unneeded function calls, and lack of function specialization which is hard to do due to the high number of possible combinations).


    Example

    Here is a first practical example showing this (center of a point cloud):

    aos = np.random.rand(1024*1024, 2)
    %timeit -n 10 aos.mean(axis=0)          # 17.6 ms ± 503 µs per loop
    
    soa = np.random.rand(2, 1024*1024)
    %timeit -n 10 soa.mean(axis=1)          #  1.7 ms ±  77 µs per loop
    

    Here we can see that the SoA version is much faster (about 10 times). This is because the SoA version benefit from SIMD instruction while the former does not and suffer from the internal Numpy iterator overhead. Technically, please note that the AoS version could be implemented to be nearly as fast as the SoA version here but Numpy is not able to optimize this yet (nor any similar cases which are actually not so easy to optimize).


    In your case

    For matrices, Numpy can call BLAS functions on contiguous arrays, which is good for performance. However, a 4x4 matrix-vector operation takes a tiny amount of time: only few nanoseconds on mainstream CPUs (for AoS). Indeed, multiplying the 4x4 matrix rows by a vector takes only 4 AVX instructions that can typically be computed in only 2 cycles. Then comes the sum reduction which takes few nanoseconds too (~4 cycles per line for a naive hadd reduction that is 16 cycles for the whole matrix). Meanwhile, a BLAS function call from Numpy and the management of internal iterators takes significantly more than 10 ns per matrix to compute. This means most of the time will be spent in Numpy overheads with a AoS layout. Thus, a np.array((5000, 4, 4)) will certainly not be so bad, but clearly far from being optimal. You can strongly reduce these overheads by writing your own specialized implementation (with Cython/Numba) specifically designed for 4x4 matrices. Here is an example of relatively-fast AoS computation using Numba.

    With a SoA data layout (i.e. (4, 4, 5000)), you can write your own vectorized operations (e.g. SoA-based matrix multiplication). A naive implementation will certainly not be very efficient either because creating/filling temporary Numpy array is expensive. However, temporary arrays can often be preallocated/reused and operations can be often done in-place so to reduce the overheads. On top of that, you can tune the size of the temporary array so it can fit in the L1 cache (though this is tedious to do since it makes the code more complex so generally Numpy users don't want to do that). That being said, calling Numpy functions from CPython also has a significant overhead (generally 0.2-3.0 µs on my i5-9600KF CPU). This is a problem since doing basic computation on 5000 double-precision floating-point numbers in the L1 cache typically takes less than 1 µs. As a result, there is a good chance for most of the time to be spent in CPython/Numpy overheads with a SoA array having only 5000 items manipulated only using Numpy. Here again, Cython/Numba can be used to nearly remove these overheads. The resulting Cython/Numba code should be faster on SoA than AoS arrays (mainly because of horizontal SIMD operations are generally inefficient and AoS operations tends to be hard to optimize, especially on modern CPUs with wide SIMD instruction set).


    Conclusion

    This is a complicated topic. In your specific case, I expect both SoA and AoS to be inefficient if you only use Numpy (but the SoA version might be a bit faster) : most of the time will be spent in overheads. As a result, the speed of the best implementation is dependent of the exact algorithm implementation and even the CPU used (so the best is to try which one is better in practice in practice).

    That being said, I think using SoA is significantly better performance-wise than AoS. Indeed, codes operating on SoA arrays can be optimized more easily and further than AoS ones (see Cython/Numba or even native C code).

    On top of that, SoA-based codes are much more likely to benefit from accelerators like GPUs. Indeed, GPUs are massively-SIMD hardware devices operating on wide SIMD vectors (e.g. 32 items at once). 4x4 contiguous AoS matrix operation are generally pretty inefficient on them, meanwhile SIMD-friendly SoA ones are cheap.

    I advise you to write a clean/simple Numpy code first while preferring a SoA layout for your array, and then optimize slow parts of the code later (possibly with Cython/Numba/native codes). This strategy often results in relatively-clean codes that are simple to optimize.