Search code examples
pythonnumpyscikit-learnpartitionquadtree

Concatenating arrays of different sizes


I am trying to do a quadtree algorithm on numpy array of points created by make_blobs function from sklearn. I am trying to create an Recursive Partition KMeans in which the centroids are found in each quadtree partition of the space. Here is my partitioning function:

def partition(self, data):
    if data.size != 0:
        minX = np.min(data[:,0])
        maxX = np.max(data[:,0])
        minY = np.min(data[:,1])
        maxY = np.max(data[:,1])
        middleX = (maxX + minX)/2
        middleY = (maxY + minY)/2
        parts1 = np.array([i for i in data if i[0] < middleX and i[1] > middleY])
        parts2 = np.array([i for i in data if i[0] > middleX and i[1] > middleY])
        parts3 = np.array([i for i in data if i[0] < middleX and i[1] < middleY])
        parts4 = np.array([i for i in data if i[0] > middleX and i[1] < middleY])
        parts = np.array([parts1, parts2, parts3, parts4])
        return parts
    else:
        return np.array([[], [], [], []])            

My dataset created by the make_blobs function has the following structure:

[[ 9.26360832 -9.18849755] [ 7.3971609 9.92622627] [ 7.29022892 -10.39359926] ... [ 8.66667995 -11.99184453] [ 5.80627027 10.53947197] [ 6.14214488 -0.73405016]]

The example output of this function could be:

[array([[3.95348068, 4.74190848]]) array([[4.47174131, 4.67345222], [4.73856072, 4.68464296]]) array([], dtype=float64) array([[4.48952751, 4.38898038], [4.47734611, 4.34300488]])]

, which is shape (4,). However it could be also shape (4,1,2) like following:

[[[-7.17718091 -4.92636967]]

[[-6.66796907 -4.94025585]]

[[-7.03501112 -5.17783394]]

[[-6.45835039 -5.17271443]]]

Then I am trying to concatenate the partitions, so that I get one big array of arrays with the partitions. This is the line responsible for concatenation:

part_data = np.hstack([self.partition(d) for d in part_data if np.shape(self.partition(d)) != (4,0)])

The problem occurs when the partitions are empty or equal, so the shape is (4,0), (4,1,2) or (4,2,2). The arrays cannot get concatenated that way. The error states following:

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 10 has 3 dimension(s)

Would it be possible to ignore these shapes or somehow reshape them to (4,)? Maybe there is some trick to append not as array but as an object? I would be grateful for any response. This is the whole code for this example:

import numpy as np
from sklearn.datasets import make_blobs

def generateDataset(k, dimensions, n_samples):
    X, y_true = make_blobs(n_samples = n_samples, centers = k, n_features= dimensions, cluster_std = 1.1)
    return X, y_true
X, y_true = generateDataset(3,2,10000)

def partition(data):
    if data.size != 0:
        minX = np.min(data[:,0])
        maxX = np.max(data[:,0])
        minY = np.min(data[:,1])
        maxY = np.max(data[:,1])
        middleX = (maxX + minX)/2
        middleY = (maxY + minY)/2
        parts1 = np.array([i for i in data if i[0] < middleX and i[1] > middleY])
        parts2 = np.array([i for i in data if i[0] > middleX and i[1] > middleY])
        parts3 = np.array([i for i in data if i[0] < middleX and i[1] < middleY])
        parts4 = np.array([i for i in data if i[0] > middleX and i[1] < middleY])
        parts = np.array([parts1, parts2, parts3, parts4])
        return parts
    else:
        return np.array([[], [], [], []])

part_data = partition(X)
for i in range(6):
    if i >= 1:
        part_data = np.hstack([partition(d) for d in part_data if np.shape(partition(d)) != (4,0)])

Solution

  • When I first read the question, I thought you were trying to hstack array with shape: (4,0), (4,1,2) or (4,2,2). But with the comments it appears that there are also shape (4,) arrays well.

    The 4 part comes from joining 4 elements

    parts = np.array([parts1, parts2, parts3, parts4])
    

    each of those being the result of an expression like:

    parts1 = np.array([i for i in data if i[0] < middleX and i[1] > middleY])
    

    You don't give a sample of data (don't expect us to recreate it from your code!), and not even an example of those parts.

    When I construct a sample 2d array, guess as to what will work:

    In [18]: data = np.array([[1,3],[2,4],[3,1]])
    In [19]: [i for i in data]           # iterate on the rows
    Out[19]: [array([1, 3]), array([2, 4]), array([3, 1])]
    

    various 'range' tests:

    In [20]: [i for i in data if i[0]<2 and i[1]>2]
    Out[20]: [array([1, 3])]
    In [21]: np.array(_)
    Out[21]: array([[1, 3]])
    In [22]: _.shape
    Out[22]: (1, 2)
    In [23]: [i for i in data if i[0]<2 and i[1]>3]
    Out[23]: []
    In [24]: [i for i in data if i[0]<2 and i[1]>1]
    Out[24]: [array([1, 3])]
    In [25]: [i for i in data if i[0]<1 and i[1]>1]
    Out[25]: []
    In [26]: [i for i in data if i[0]<3 and i[1]>1]
    Out[26]: [array([1, 3]), array([2, 4])]
    In [27]: np.array([i for i in data if i[0]<3 and i[1]>1])
    Out[27]: 
    array([[1, 3],
           [2, 4]])
    In [29]: np.array([i for i in data if i[0]<3 and i[1]>3])
    Out[29]: array([[2, 4]])
    

    So I can get a parts array that is (0,), (1,2), or (2,2) (or more for the first dimension).

    Joing 4 of those into an array and get a (4,1,2) etc. But wait, each of those 4 tests could give different size arrays, in which case np.array(parts....) will produce an object dtype array with shape (4,).

    Is that what's going on? You have a mix of mostly (4,) object dtype arrays along with some (4,0) and (4,n,2) shaped numeric dtype ones?

    More than full code, or minimal example, we should have demanded that you show the list that you are trying to hstack:

    [partition(d) for d in part_data if np.shape(partition(d)) != (4,0)]
    

    Let's try to make the partition array from 4 of those sample results:

    In [46]: [Out[20],Out[27],Out[25],Out[29]]
    Out[46]: 
    [[array([1, 3])],
     array([[1, 3],
            [2, 4]]),
     [],
     array([[2, 4]])]
    In [47]: x1=np.array([Out[20],Out[27],Out[25],Out[29]])
    <ipython-input-47-b04a5e3fb51c>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
      x1=np.array([Out[20],Out[27],Out[25],Out[29]])
    In [48]: x1
    Out[48]: 
    array([list([array([1, 3])]), array([[1, 3],
                                         [2, 4]]), list([]), array([[2, 4]])],
          dtype=object)
    

    Did you get that ragged array warning? Note that the resulting array is (4,) object dtype.

    If instead all parts are the same shape, such as (1,2):

    In [49]: x2=np.array([Out[29],Out[29],Out[29],Out[29]])
    In [50]: x2.shape
    Out[50]: (4, 1, 2)
    In [51]: x2
    Out[51]: 
    array([[[2, 4]],
    
           [[2, 4]],
    
           [[2, 4]],
    
           [[2, 4]]])
    

    or a (4,0)

    In [54]: x3=np.array([Out[23],Out[23],Out[23],Out[23]])
    In [55]: x3
    Out[55]: array([], shape=(4, 0), dtype=float64)
    
    In [56]: x4=np.array([Out[27],Out[27],Out[27],Out[27]])
    In [57]: x4.shape
    Out[57]: (4, 2, 2)
    

    Even witout the (4,0) we get dimensions mismatch:

    In [59]: np.hstack((x1,x2,x4))
    Traceback (most recent call last):
      Input In [59] in <cell line: 1>
        np.hstack((x1,x2,x4))
      File <__array_function__ internals>:180 in hstack
      File /usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py:343 in hstack
        return _nx.concatenate(arrs, 0)
      File <__array_function__ internals>:180 in concatenate
    ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 3 dimension(s)
    

    We can join several of the (4,) into a new object dtype array:

    In [61]: np.hstack((x1,x1,x1)).shape
    Out[61]: (12,)
    

    The key issue is the np.array((part1,part2,...)) is not a reliable way of making a (4,) object dtype array. Sometimes if make a (4,) with warning, sometimes it makes a (4,0) or (4,n,2). By glossing over the ragged warning you confused both yourself and us!

    If we define a helper function, we can reliablely make an object dtype array, even when the inputs are all identical in shape:

    In [62]: def foo(*args):
        ...:     res = np.empty(len(args),object)
        ...:     res[:] = args
        ...:     return res
        ...: 
    

    Using that to recreate the 4 parts:

    In [63]: x1 = foo([Out[29],Out[29],Out[29],Out[29]])
    In [64]: x1.shape,x1.dtype
    Out[64]: ((1,), dtype('O'))
    In [65]: x1 = foo(Out[29],Out[29],Out[29],Out[29])
    In [66]: x1.shape, x1.dtype
    Out[66]: ((4,), dtype('O'))
    In [67]: x2=foo(Out[29],Out[29],Out[29],Out[29])
    In [68]: x2.shape, x2.dtype
    Out[68]: ((4,), dtype('O'))
    In [69]: x3=foo(Out[23],Out[23],Out[23],Out[23])
    In [70]: x3.shape, x3.dtype
    Out[70]: ((4,), dtype('O'))
    In [71]: x4=foo(Out[27],Out[27],Out[27],Out[27])
    In [72]: x4.shape, x4.dtype
    Out[72]: ((4,), dtype('O'))
    In [73]: arr = np.hstack((x1,x2,x3,x4))
    In [74]: arr.shape
    Out[74]: (16,)
    

    The resulting array is a bit messy, but worth looking at. Is that really what you want and will be able to use:

    In [75]: arr
    Out[75]: 
    array([array([[2, 4]]), array([[2, 4]]), array([[2, 4]]), array([[2, 4]]),
           array([[2, 4]]), array([[2, 4]]), array([[2, 4]]), array([[2, 4]]),
           list([]), list([]), list([]), list([]), array([[1, 3],
                                                          [2, 4]]),
           array([[1, 3],
                  [2, 4]]), array([[1, 3],
                                   [2, 4]]), array([[1, 3],
                                                    [2, 4]])], dtype=object)
    

    The list equivalent might be just as useful:

    In [76]: arr.tolist()
    Out[76]: 
    [array([[2, 4]]),
     array([[2, 4]]),
     array([[2, 4]]),
     array([[2, 4]]),
     array([[2, 4]]),
     array([[2, 4]]),
     array([[2, 4]]),
     array([[2, 4]]),
     [],
     [],
     [],
     [],
     array([[1, 3],
            [2, 4]]),
     array([[1, 3],
            [2, 4]]),
     array([[1, 3],
            [2, 4]]),
     array([[1, 3],
            [2, 4]])]
    

    The x3 case where inputs are all empty lists may need some refinement:

    In [80]: x3
    Out[80]: array([list([]), list([]), list([]), list([])], dtype=object)
    

    edit

    The sample array that you added is:

    [array([[3.95348068, 4.74190848]]) 
     array([[4.47174131, 4.67345222], 
            [4.73856072, 4.68464296]]) 
     array([], dtype=float64) 
     array([[4.48952751, 4.38898038], 
            [4.47734611, 4.34300488]])]
    

    That is (4,) (not (4,0) or (4,1)), and object dtype. That's much like a list, containing references to 4 arrays. Those arrays differ in shape, (1,2),(2,2),(0,),(2,2). Because of the differing shapes, it can only make a object dtype array (with the ragged array warning).

    The following example is (4,1,2), made by applying np.array to a list of 4 arrays all with shape (1,2). np.array preferentially makes a multidimensional numeric array. Making a (4,) object array from that list requires special action as I show in the foo function.