Python: more efficient data structure than a nested dictionary of dictionaries of arrays?

I'm writing a python-3.10 program that predicts time series of various properties for a large number of objects. My current choice of data structure for collecting results internally in the code and then for writing to files is a nested dictionary of dictionaries of arrays. For example, for two objects with time series of 3 properties:

properties = {'obj1':{'time':np.arange(10),'x':np.random.randn(10),'vx':np.random.randn(10)},
'obj2': {'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}}

The reason I like this nested dictionary format is because it is intuitive to access -- the outer key is the object name, and the inner keys are the property names. The elements corresponding to each of the inner keys are numpy arrays giving the value of some property as a function of time. My actual code generates a dict of ~100,000s of objects (outer keys) each having ~100 properties (inner keys) recorded at ~1000 times (numpy float arrays).

I have noticed that when I do np.savez('filename.npz',**properties) on my own huge properties dictionary (or subsets of it), it takes a while and the output file sizes are a few GB (probably because np.savez is calling pickle under the hood since my nested dict is not an array).

Is there a more efficient data structure widely applicable for my use case? Is it worth switching from my nested dict to pandas dataframes, numpy ndarrays or record arrays, or a list of some kind of Table-like objects? It would be nice to be able to save/load the file in a binary output format that preserves the mapping from object names to their dict/array/table/dataframe of properties, and of course the names of each of the property time series arrays.

Solution

Let's look at your obj2 value, a dict:

In [307]: dd={'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}

In [308]: dd
Out[308]: 
{'time': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
 'x': array([-0.48197915,  0.15597792,  0.44113401,  1.38062753, -1.21273378,
        -1.27120008,  1.53072667,  1.9799255 ,  0.13647925, -1.37056793,
        -2.06470784,  0.92314969,  0.30885371,  0.64860014,  1.30273519]),
 'vx': array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
         1.86391024,  1.006901  , -0.16168439,  1.5180135 , -1.16436363,
        -0.20254291, -1.60280149, -1.91749387,  0.25366602, -1.61993012])}

It's easy to make a dataframe from that:

In [309]: df = pd.DataFrame(dd)

In [310]: df
Out[310]: 
    time         x        vx
0      0 -0.481979 -1.602281
1      1  0.155978 -1.491630
2      2  0.441134 -1.170610
3      3  1.380628 -0.092675
4      4 -1.212734 -0.941331
5      5 -1.271200  1.863910
6      6  1.530727  1.006901
7      7  1.979926 -0.161684
8      8  0.136479  1.518014
9      9 -1.370568 -1.164364
10    10 -2.064708 -0.202543
11    11  0.923150 -1.602801
12    12  0.308854 -1.917494
13    13  0.648600  0.253666
14    14  1.302735 -1.619930

We could also make structured array from that frame. I could also make the array directly from your dict, defining the same compound dtype. But since I already have the frame, I'll go this route. The distinction between structured array and recarray is minor.

In [312]: arr = df.to_records()

In [313]: arr
Out[313]: 
rec.array([( 0,  0, -0.48197915, -1.60228105),
           ( 1,  1,  0.15597792, -1.49163002),
           ( 2,  2,  0.44113401, -1.17061046),
           ( 3,  3,  1.38062753, -0.09267467),
           ( 4,  4, -1.21273378, -0.94133092),
           ( 5,  5, -1.27120008,  1.86391024),
           ( 6,  6,  1.53072667,  1.006901  ),
           ( 7,  7,  1.9799255 , -0.16168439),
           ( 8,  8,  0.13647925,  1.5180135 ),
           ( 9,  9, -1.37056793, -1.16436363),
           (10, 10, -2.06470784, -0.20254291),
           (11, 11,  0.92314969, -1.60280149),
           (12, 12,  0.30885371, -1.91749387),
           (13, 13,  0.64860014,  0.25366602),
           (14, 14,  1.30273519, -1.61993012)],
          dtype=[('index', '<i8'), ('time', '<i4'), ('x', '<f8'), ('vx', '<f8')])

Now let's compare the pickle strings:

In [314]: import pickle

In [315]: len(pickle.dumps(dd))
Out[315]: 561

In [316]: len(pickle.dumps(df))      # df.to_pickle makes a 1079 byte file
Out[316]: 1052

In [317]: len(pickle.dumps(arr))     # arr.nbytes is 420
Out[317]: 738                        # np.save writes a 612 byte file

And other encoding - a list:

In [318]: alist = list(dd.items())
In [319]: alist
Out[319]: 
[('time', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])),
 ('x',
  array([-0.48197915,  0.15597792,  0.44113401,  1.38062753, -1.21273378,
         -1.27120008,  1.53072667,  1.9799255 ,  0.13647925, -1.37056793,
         -2.06470784,  0.92314969,  0.30885371,  0.64860014,  1.30273519])),
 ('vx',
  array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
          1.86391024,  1.006901  , -0.16168439,  1.5180135 , -1.16436363,
         -0.20254291, -1.60280149, -1.91749387,  0.25366602, -1.61993012]))]
In [320]: len(pickle.dumps(alist))
Out[320]: 567