Search code examples
pythonarraysnumpyunique

How to get unique rows and their occurrences for 2D array?


I have a 2D array, and it has some duplicate columns. I would like to be able to see which unique columns there are, and where the duplicates are.

My own array is too large to put here, but here is an example:

a = np.array([[ 1.,  0.,  0.,  0.,  0.],[ 2.,  0.,  4.,  3.,  0.],])

This has the unique column vectors [1.,2.], [0.,0.], [0.,4.] and [0.,3.]. There is one duplicate: [0.,0.] appears twice.

Now I found a way to get the unique vectors and their indices here but it is not clear to me how I would get the occurences of duplicates as well. I have tried several naive ways (with np.where and list comps) but those are all very very slow. Surely there has to be a numpythonic way?

In matlab it's just the unique function but np.unique flattens arrays.


Solution

  • Here's a vectorized approach to give us a list of arrays as output -

    ids = np.ravel_multi_index(a.astype(int),a.max(1).astype(int)+1)
    sidx = ids.argsort()
    sorted_ids = ids[sidx]
    out = np.split(sidx,np.nonzero(sorted_ids[1:] > sorted_ids[:-1])[0]+1)
    

    Sample run -

    In [62]: a
    Out[62]: 
    array([[ 1.,  0.,  0.,  0.,  0.],
           [ 2.,  0.,  4.,  3.,  0.]])
    
    In [63]: out
    Out[63]: [array([1, 4]), array([3]), array([2]), array([0])]