Search code examples
pythonnumpymultidimensional-arrayuniquetensor

Numpy way to check if any two samples in a tensor are identical


Some examples:

import numpy as np
tensor_same = np.array([[1]*10 + [2] * 10 + [1] * 10]).reshape((-1, 10, 1))
tensor_diff = np.array([[1]*10 + [2] * 10 + [1] * 9 + [2]]).reshape((-1, 10, 1))

The first tensor has two samples that are the same. In the second, all samples are different.

What's the fastest way of checking this for very large tensors?


Solution

  • We can use np.unique along the first axis to get unique number of blocks and if that's same as number of elements in the original input, that would indicate all different samples, otherwise at least one duplicate, like so -

    In [25]: len(np.unique(tensor_same,axis=0)) != len(tensor_same)
    Out[25]: True
    
    In [26]: len(np.unique(tensor_diff,axis=0)) != len(tensor_diff)
    Out[26]: False
    

    Another way would be to use the counts returned by np.unique -

    In [42]: (np.unique(tensor_same,axis=0, return_counts=1)[1]>1).any()
    Out[42]: True
    
    In [43]: (np.unique(tensor_diff,axis=0, return_counts=1)[1]>1).any()
    Out[43]: False
    

    Another way would be to sort along the first axis, perform consecutive element differentiation and then look for all zeros along the second axis and finally ANY match -

    In [29]: (np.diff(np.sort(tensor_same,axis=0),axis=0)==0).all(1).any()
    Out[29]: True
    
    In [30]: (np.diff(np.sort(tensor_diff,axis=0),axis=0)==0).all(1).any()
    Out[30]: False
    

    Another way would be to use views such that each 2D block is seen as one element each and then we employ the same sorting and looking for identical consecutive elements, like so -

    # https://stackoverflow.com/a/44999009/ @Divakar
    def view1D(a): # a is array
        a = np.ascontiguousarray(a)
        void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
        return a.view(void_dt).ravel()
    
    def is_any_identical(a):
        a1D = view1D(a.reshape(a.shape[0],-1))
        a1Ds = np.sort(a1D)
        return (a1Ds[:-1] == a1Ds[1:]).any()
    

    Sample run -

    In [90]: np.random.seed(0)
        ...: a = np.random.randint(11,99,(6,4,3))
    
    In [91]: is_any_identical(a)
    Out[91]: False
    
    In [92]: a[2] = a[1] # force one identical element
    
    In [93]: is_any_identical(a)
    Out[93]: True
    

    For positive ints, alternatively we can use np.einsum to get the same dimensionality-reduction and end up with one element each for a 2D block. Hence, we would have a1D equivalent in is_any_identical() like so -

    a1D = np.einsum('ijk,jk->i',a,a.max(0)+1)