Search code examples
pythonnumpynumpy-ndarraypytables

Combine two NumPy arrays into one structured array for appending to a PyTables table


I have two unstructured NumPy arrays a and b with shapes (N,) and (N, 256, 2) respectively and dtype np.float. I wish to combine these into a single structured array with shape (N,) and dtype [('field1', np.float), ('field2', np.float, (256, 2))].

The documentation on this is surprisingly lacking. I've found methods like np.lib.recfunctions.merge_arrays but have not been able to find the precise combination of features required to do this.


For the sake of avoiding the XY problem, I'll state my wider aims.

I have a PyTables table with layout {"field1": tables.FloatCol(), "field2": tables.FloatCol(shape = (256, 2))}. The two NumPy arrays represent N new rows to be appended to each of these fields. N is large, so I wish to do this with a single efficient table.append(rows) call, rather than the slow process of looping through table.row['field'] = ....

The table.append documentation says

The rows argument may be any object which can be converted to a structured array compliant with the table structure (otherwise, a ValueError is raised). This includes NumPy structured arrays, lists of tuples or array records, and a string or Python buffer.

Converting my arrays to an appropriate structured array seems to be what I should be doing here. I'm looking for speed, and I anticipate the other options being slower.


Solution

  • Define the dtype, and create an empty/zeros array:

    In [163]: dt = np.dtype([('field1', np.float), ('field2', np.float, (4, 2))])            
    In [164]: arr = np.zeros(3, dt)     # float display is prettier                                                          
    In [165]: arr                                                                            
    Out[165]: 
    array([(0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]]),
           (0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]]),
           (0., [[0., 0.], [0., 0.], [0., 0.], [0., 0.]])],
          dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])
    

    Assign values field by field:

    In [166]: arr['field1'] = np.arange(3)                                                   
    In [167]: arr['field2'].shape                                                            
    Out[167]: (3, 4, 2)
    In [168]: arr['field2'] = np.arange(24).reshape(3,4,2)                                   
    In [169]: arr                                                                            
    Out[169]: 
    array([(0., [[ 0.,  1.], [ 2.,  3.], [ 4.,  5.], [ 6.,  7.]]),
           (1., [[ 8.,  9.], [10., 11.], [12., 13.], [14., 15.]]),
           (2., [[16., 17.], [18., 19.], [20., 21.], [22., 23.]])],
          dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])
    

    np.rec does have a function that works similarly:

    In [174]: np.rec.fromarrays([np.arange(3.), np.arange(24).reshape(3,4,2)], dtype=dt)     
    Out[174]: 
    rec.array([(0., [[ 0.,  1.], [ 2.,  3.], [ 4.,  5.], [ 6.,  7.]]),
               (1., [[ 8.,  9.], [10., 11.], [12., 13.], [14., 15.]]),
               (2., [[16., 17.], [18., 19.], [20., 21.], [22., 23.]])],
              dtype=[('field1', '<f8'), ('field2', '<f8', (4, 2))])
    

    This is the same, except fields can be accessed as attributes (as well). Under the covers it does the same by-field assignment.

    numpy.lib.recfunctions is another collection of structured array functions. These too mostly follow the by-field assignment approach.