python numpy multidimensional-array unique tensor

Numpy way to check if any two samples in a tensor are identical

Some examples:

import numpy as np
tensor_same = np.array([[1]*10 + [2] * 10 + [1] * 10]).reshape((-1, 10, 1))
tensor_diff = np.array([[1]*10 + [2] * 10 + [1] * 9 + [2]]).reshape((-1, 10, 1))

The first tensor has two samples that are the same. In the second, all samples are different.

What's the fastest way of checking this for very large tensors?

Solution

We can use np.unique along the first axis to get unique number of blocks and if that's same as number of elements in the original input, that would indicate all different samples, otherwise at least one duplicate, like so -

In [25]: len(np.unique(tensor_same,axis=0)) != len(tensor_same)
Out[25]: True

In [26]: len(np.unique(tensor_diff,axis=0)) != len(tensor_diff)
Out[26]: False

Another way would be to use the counts returned by np.unique -

In [42]: (np.unique(tensor_same,axis=0, return_counts=1)[1]>1).any()
Out[42]: True

In [43]: (np.unique(tensor_diff,axis=0, return_counts=1)[1]>1).any()
Out[43]: False

Another way would be to sort along the first axis, perform consecutive element differentiation and then look for all zeros along the second axis and finally ANY match -

In [29]: (np.diff(np.sort(tensor_same,axis=0),axis=0)==0).all(1).any()
Out[29]: True

In [30]: (np.diff(np.sort(tensor_diff,axis=0),axis=0)==0).all(1).any()
Out[30]: False

Another way would be to use views such that each 2D block is seen as one element each and then we employ the same sorting and looking for identical consecutive elements, like so -

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

def is_any_identical(a):
    a1D = view1D(a.reshape(a.shape[0],-1))
    a1Ds = np.sort(a1D)
    return (a1Ds[:-1] == a1Ds[1:]).any()

Sample run -

In [90]: np.random.seed(0)
    ...: a = np.random.randint(11,99,(6,4,3))

In [91]: is_any_identical(a)
Out[91]: False

In [92]: a[2] = a[1] # force one identical element

In [93]: is_any_identical(a)
Out[93]: True

For positive ints, alternatively we can use np.einsum to get the same dimensionality-reduction and end up with one element each for a 2D block. Hence, we would have a1D equivalent in is_any_identical() like so -

a1D = np.einsum('ijk,jk->i',a,a.max(0)+1)