Search code examples
pythonnumpyvalueerrordtype

Does numpy handle arrays with dtype wrong?


The following code snippet

f_folds = 3
fold_quantities = np.array([(0, 0, 0)])
for i in np.arange(n_folds) + 1:
    fold_quantities = np.concatenate(
        (fold_quantities, [(i, 0, 0)])
    )
print(fold_quantities)

gives me

array([[ 0,  0,  0],
       [ 1,  0,  0],
       [ 2,  0,  0],
       [ 3,  0,  0]])

When changing nothing but specifying the dtype of the ndarray

f_folds = 3
fold_quantities = np.array([(0, 0, 0)],
    dtype=[('index', int), ('#datapoints', 'int'), ('#pos_labels', 'int')])
for i in np.arange(n_folds) + 1:
    fold_quantities = np.concatenate(
        (fold_quantities, [(i, 0, 0)])
    )
print(fold_quantities)

it throws an error

ValueError   Traceback (most recent call last)
<ipython-input-174-649369eed10a> in <module>
      5     fold_quantities = np.concatenate(
      6         (fold_quantities,
----> 7          [(i, 0, 0)])
      8     )
      9 print(fold_quantities)

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s)

This message seems to make no sense. The array dimensions did not change.

How should that be handled? I would like to have the dtype specified since I want to sort the array according to single columns with sorted(key=).


Solution

  • Your first array should be make with list append, or a list comprehension. Repeated concatenate is slower

    In [97]: np.array([[i,0,0] for i in range(4)])                                                 
    Out[97]: 
    array([[0, 0, 0],
       [1, 0, 0],
       [2, 0, 0],
       [3, 0, 0]])
    

    With the compound dtype:

    In [100]: np.array([(i,0,0) for i in range(4)], dtype=dt)                                      
    Out[100]: 
    array([(0, 0, 0), (1, 0, 0), (2, 0, 0), (3, 0, 0)],
          dtype=[('index', '<i8'), ('#datapoints', '<i8'), ('#pos_labels', '<i8')])
    

    Note the use of dt and tuple instead of list. Data for a structured array has to be in the form of a list of tuples (just like the display).

    With the change in dtype, the shape changes:

    In [101]: _100.shape                                                                           
    Out[101]: (4,)
    In [102]: _97.shape                                                                            
    Out[102]: (4, 3)
    

    To add an array to structured array, it has to have a compatible dtype and shape:

    In [104]: np.array([(4,0,0)],dt)                                                               
    Out[104]: 
    array([(4, 0, 0)],
          dtype=[('index', '<i8'), ('#datapoints', '<i8'), ('#pos_labels', '<i8')])
    

    This is a (1,) array with dt dype.

    In [105]: np.concatenate([_100, _104])                                                         
    Out[105]: 
    array([(0, 0, 0), (1, 0, 0), (2, 0, 0), (3, 0, 0), (4, 0, 0)],
          dtype=[('index', '<i8'), ('#datapoints', '<i8'), ('#pos_labels', '<i8')])
    In [106]: _.shape                                                                              
    Out[106]: (5,)
    

    Another way of making the structured array - start with a list of arrays with the correct dtype:

    In [107]: alist = [np.array((i,0,0),dt) for i in range(4)]                                     
    In [108]: alist                                                                                
    Out[108]: 
    [array((0, 0, 0),
           dtype=[('index', '<i8'), ('#datapoints', '<i8'), ('#pos_labels', '<i8')]),
     array((1, 0, 0),
           dtype=[('index', '<i8'), ('#datapoints', '<i8'), ('#pos_labels', '<i8')]),
     array((2, 0, 0),
           dtype=[('index', '<i8'), ('#datapoints', '<i8'), ('#pos_labels', '<i8')]),
     array((3, 0, 0),
           dtype=[('index', '<i8'), ('#datapoints', '<i8'), ('#pos_labels', '<i8')])]
    

    I use stack to join them since all 3 have 0d, scalar arrays.

    In [109]: np.stack(alist)                                                                      
    Out[109]: 
    array([(0, 0, 0), (1, 0, 0), (2, 0, 0), (3, 0, 0)],
          dtype=[('index', '<i8'), ('#datapoints', '<i8'), ('#pos_labels', '<i8')])