Some examples:
import numpy as np
tensor_same = np.array([[1]*10 + [2] * 10 + [1] * 10]).reshape((-1, 10, 1))
tensor_diff = np.array([[1]*10 + [2] * 10 + [1] * 9 + [2]]).reshape((-1, 10, 1))
The first tensor has two samples that are the same. In the second, all samples are different.
What's the fastest way of checking this for very large tensors?
We can use np.unique
along the first axis to get unique number of blocks and if that's same as number of elements in the original input, that would indicate all different samples, otherwise at least one duplicate, like so -
In [25]: len(np.unique(tensor_same,axis=0)) != len(tensor_same)
Out[25]: True
In [26]: len(np.unique(tensor_diff,axis=0)) != len(tensor_diff)
Out[26]: False
Another way would be to use the counts returned by np.unique
-
In [42]: (np.unique(tensor_same,axis=0, return_counts=1)[1]>1).any()
Out[42]: True
In [43]: (np.unique(tensor_diff,axis=0, return_counts=1)[1]>1).any()
Out[43]: False
Another way would be to sort along the first axis, perform consecutive element differentiation and then look for all zeros along the second axis and finally ANY
match -
In [29]: (np.diff(np.sort(tensor_same,axis=0),axis=0)==0).all(1).any()
Out[29]: True
In [30]: (np.diff(np.sort(tensor_diff,axis=0),axis=0)==0).all(1).any()
Out[30]: False
Another way would be to use views
such that each 2D
block is seen as one element each and then we employ the same sorting and looking for identical consecutive elements, like so -
# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
def is_any_identical(a):
a1D = view1D(a.reshape(a.shape[0],-1))
a1Ds = np.sort(a1D)
return (a1Ds[:-1] == a1Ds[1:]).any()
Sample run -
In [90]: np.random.seed(0)
...: a = np.random.randint(11,99,(6,4,3))
In [91]: is_any_identical(a)
Out[91]: False
In [92]: a[2] = a[1] # force one identical element
In [93]: is_any_identical(a)
Out[93]: True
For positive ints
, alternatively we can use np.einsum
to get the same dimensionality-reduction and end up with one element each for a 2D
block. Hence, we would have a1D
equivalent in is_any_identical()
like so -
a1D = np.einsum('ijk,jk->i',a,a.max(0)+1)