Search code examples
pythonnumpydictionarynestedpickle

Python: more efficient data structure than a nested dictionary of dictionaries of arrays?


I'm writing a python-3.10 program that predicts time series of various properties for a large number of objects. My current choice of data structure for collecting results internally in the code and then for writing to files is a nested dictionary of dictionaries of arrays. For example, for two objects with time series of 3 properties:

properties = {'obj1':{'time':np.arange(10),'x':np.random.randn(10),'vx':np.random.randn(10)},
'obj2': {'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}}

The reason I like this nested dictionary format is because it is intuitive to access -- the outer key is the object name, and the inner keys are the property names. The elements corresponding to each of the inner keys are numpy arrays giving the value of some property as a function of time. My actual code generates a dict of ~100,000s of objects (outer keys) each having ~100 properties (inner keys) recorded at ~1000 times (numpy float arrays).

I have noticed that when I do np.savez('filename.npz',**properties) on my own huge properties dictionary (or subsets of it), it takes a while and the output file sizes are a few GB (probably because np.savez is calling pickle under the hood since my nested dict is not an array).

Is there a more efficient data structure widely applicable for my use case? Is it worth switching from my nested dict to pandas dataframes, numpy ndarrays or record arrays, or a list of some kind of Table-like objects? It would be nice to be able to save/load the file in a binary output format that preserves the mapping from object names to their dict/array/table/dataframe of properties, and of course the names of each of the property time series arrays.


Solution

  • Let's look at your obj2 value, a dict:

    In [307]: dd={'time':np.arange(15),'x':np.random.randn(15),'vx':np.random.randn(15)}
    
    In [308]: dd
    Out[308]: 
    {'time': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
     'x': array([-0.48197915,  0.15597792,  0.44113401,  1.38062753, -1.21273378,
            -1.27120008,  1.53072667,  1.9799255 ,  0.13647925, -1.37056793,
            -2.06470784,  0.92314969,  0.30885371,  0.64860014,  1.30273519]),
     'vx': array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
             1.86391024,  1.006901  , -0.16168439,  1.5180135 , -1.16436363,
            -0.20254291, -1.60280149, -1.91749387,  0.25366602, -1.61993012])}
    

    It's easy to make a dataframe from that:

    In [309]: df = pd.DataFrame(dd)
    
    In [310]: df
    Out[310]: 
        time         x        vx
    0      0 -0.481979 -1.602281
    1      1  0.155978 -1.491630
    2      2  0.441134 -1.170610
    3      3  1.380628 -0.092675
    4      4 -1.212734 -0.941331
    5      5 -1.271200  1.863910
    6      6  1.530727  1.006901
    7      7  1.979926 -0.161684
    8      8  0.136479  1.518014
    9      9 -1.370568 -1.164364
    10    10 -2.064708 -0.202543
    11    11  0.923150 -1.602801
    12    12  0.308854 -1.917494
    13    13  0.648600  0.253666
    14    14  1.302735 -1.619930
    

    We could also make structured array from that frame. I could also make the array directly from your dict, defining the same compound dtype. But since I already have the frame, I'll go this route. The distinction between structured array and recarray is minor.

    In [312]: arr = df.to_records()
    
    In [313]: arr
    Out[313]: 
    rec.array([( 0,  0, -0.48197915, -1.60228105),
               ( 1,  1,  0.15597792, -1.49163002),
               ( 2,  2,  0.44113401, -1.17061046),
               ( 3,  3,  1.38062753, -0.09267467),
               ( 4,  4, -1.21273378, -0.94133092),
               ( 5,  5, -1.27120008,  1.86391024),
               ( 6,  6,  1.53072667,  1.006901  ),
               ( 7,  7,  1.9799255 , -0.16168439),
               ( 8,  8,  0.13647925,  1.5180135 ),
               ( 9,  9, -1.37056793, -1.16436363),
               (10, 10, -2.06470784, -0.20254291),
               (11, 11,  0.92314969, -1.60280149),
               (12, 12,  0.30885371, -1.91749387),
               (13, 13,  0.64860014,  0.25366602),
               (14, 14,  1.30273519, -1.61993012)],
              dtype=[('index', '<i8'), ('time', '<i4'), ('x', '<f8'), ('vx', '<f8')])
    

    Now let's compare the pickle strings:

    In [314]: import pickle
    
    In [315]: len(pickle.dumps(dd))
    Out[315]: 561
    
    In [316]: len(pickle.dumps(df))      # df.to_pickle makes a 1079 byte file
    Out[316]: 1052
    
    In [317]: len(pickle.dumps(arr))     # arr.nbytes is 420
    Out[317]: 738                        # np.save writes a 612 byte file
    

    And other encoding - a list:

    In [318]: alist = list(dd.items())
    In [319]: alist
    Out[319]: 
    [('time', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])),
     ('x',
      array([-0.48197915,  0.15597792,  0.44113401,  1.38062753, -1.21273378,
             -1.27120008,  1.53072667,  1.9799255 ,  0.13647925, -1.37056793,
             -2.06470784,  0.92314969,  0.30885371,  0.64860014,  1.30273519])),
     ('vx',
      array([-1.60228105, -1.49163002, -1.17061046, -0.09267467, -0.94133092,
              1.86391024,  1.006901  , -0.16168439,  1.5180135 , -1.16436363,
             -0.20254291, -1.60280149, -1.91749387,  0.25366602, -1.61993012]))]
    In [320]: len(pickle.dumps(alist))
    Out[320]: 567