python numpy multidimensional-array set-difference

Python Numpy get difference between 2 two-dimensional array

Well, I have a simple problem that is giving me a headache, basically I have two two-dimensional arrays, full of [x,y] coordinates, and I want to compare the first with the second and generate a third array that contains all the elements of the first array that doesn't appear in the second. It's simple but I couldn't make it work at all. The size varies a lot, the first array can have between a thousand and 2 million coordinates, while the first array has between 1 and a thousand.

This operation will occur many times and the larger the first array, the more times it will occur

sample:

arr1 = np.array([[0, 3], [0, 4], [1, 3], [1, 7], ])

arr2 = np.array([[0, 3], [1, 7]])

result = np.array([[0, 4], [1, 3]])

In Depth: Basically I have a binary image with variable resolution, it is composed of 0 and 1 (255) and I analyze each pixel individually (with an algorithm that is already optimized), but (on purpose) every time this function is executed it analyzes only a fraction of the pixels, and when it is finished it gives me back all the coordinates of these pixels. The problem is that when it executes it runs the following code:

ones = np.argwhere(img == 255) # ones = pixels array

It takes about 0.02 seconds and is by far the slowest part of the code. My idea is to create this variable once and, each time the function ends, it removes the parsed pixels and passes the new array as parameter to continue until the array is empty

Solution

Not sure what you intend to do with the extra dimensions, as the set difference, like any filtering, is inherently losing the shape information.

Anyway, NumPy does provide np.setdiff1d() to solve this problem elegantly.

EDIT With the clarifications provided, you seems to be looking for a way compute the set difference on a given axis, i.e. the elements of the sets are actually arrays.

There is no built-in specifically for this in NumPy, but it is not too difficult to craft one. For simplicity, we assume that the operating axis is the first one (so that the element of the set are arr[i]), that only unique elements appear in the first array, and that the arrays are 2D.

They are all based on the idea that the asymptotically best approach is to build a set() of the second array and then using that to filter out the entries from the first array.

The idiomatic way to build such set in Python / NumPy is to use:

set(map(tuple, arr))

where the mapping to tuple freezes arr[i], allowing them to be hashable and hence making them available to use with set().

Unfortunately, since the filtering would produce results of unpredictable size, NumPy arrays are not the ideal container for the result.

To solve this issue, one can use:

an intermediate list

import numpy as np


def setdiff2d_list(arr1, arr2):
    delta = set(map(tuple, arr2))
    return np.array([x for x in arr1 if tuple(x) not in delta])

np.fromiter() followed by np.reshape()

import numpy as np


def setdiff2d_iter(arr1, arr2):
    delta = set(map(tuple, arr2))
    return np.fromiter((x for xs in arr1 if tuple(xs) not in delta for x in xs), dtype=arr1.dtype).reshape(-1, arr1.shape[-1])

NumPy's advanced indexing

def setdiff2d_idx(arr1, arr2):
    delta = set(map(tuple, arr2))
    idx = [tuple(x) not in delta for x in arr1]
    return arr1[idx]

Convert both inputs to set() (will force uniqueness of the output elements and will lose ordering):

import numpy as np


def setdiff2d_set(arr1, arr2):
    set1 = set(map(tuple, arr1))
    set2 = set(map(tuple, arr2))
    return np.array(list(set1 - set2))

Alternatively, the advanced indexing can be built using broadcasting, np.any() and np.all():

def setdiff2d_bc(arr1, arr2):
    idx = (arr1[:, None] != arr2).any(-1).all(1)
    return arr1[idx]

Some form of the above methods were originally suggested in @QuangHoang's answer.

A similar approach could also be implemented in Numba, following the same idea as above but using a hash instead of the actual array view arr[i] (because of the limitations in what is supported inside a set() by Numba) and pre-computing the output size (for speed):

import numpy as np
import numba as nb


@nb.njit
def mul_xor_hash(arr, init=65537, k=37):
    result = init
    for x in arr.view(np.uint64):
        result = (result * k) ^ x
    return result


@nb.njit
def setdiff2d_nb(arr1, arr2):
    # : build `delta` set using hashes
    delta = {mul_xor_hash(arr2[0])}
    for i in range(1, arr2.shape[0]):
        delta.add(mul_xor_hash(arr2[i]))
    # : compute the size of the result
    n = 0
    for i in range(arr1.shape[0]):
        if mul_xor_hash(arr1[i]) not in delta:
            n += 1
    # : build the result
    result = np.empty((n, arr1.shape[-1]), dtype=arr1.dtype)
    j = 0
    for i in range(arr1.shape[0]):
        if mul_xor_hash(arr1[i]) not in delta:
            result[j] = arr1[i]
            j += 1
    return result

While they all give the same result:

funcs = setdiff2d_iter, setdiff2d_list, setdiff2d_idx, setdiff2d_set, setdiff2d_bc, setdiff2d_nb

arr1 = np.array([[0, 3], [0, 4], [1, 3], [1, 7]])
print(arr1)
# [[0 3]
#  [0 4]
#  [1 3]
#  [1 7]]

arr2 = np.array([[0, 3], [1, 7], [4, 0]])
print(arr2)
# [[0 3]
#  [1 7]
#  [4 0]]

result = funcs[0](arr1, arr2)
print(result)
# [[0 4]
#  [1 3]]

for func in funcs:
    print(f'{func.__name__:>24s}', np.all(result == func(arr1, arr2)))
#           setdiff2d_iter True
#           setdiff2d_list True
#            setdiff2d_idx True
#            setdiff2d_set False  # because of ordering
#             setdiff2d_bc True
#             setdiff2d_nb True

their performance seems to be varying:

for func in funcs:
    print(f'{func.__name__:>24s}', end='  ')
    %timeit func(arr1, arr2)
#           setdiff2d_iter  16.3 µs ± 719 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#           setdiff2d_list  14.9 µs ± 528 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#            setdiff2d_idx  17.8 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#            setdiff2d_set  17.5 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#             setdiff2d_bc  9.45 µs ± 405 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#             setdiff2d_nb  1.58 µs ± 51.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

The Numba-based approach proposed seems to outperform the other approaches by a fair margin (some 10x using the given input).

Similar timings are observed with larger inputs:

np.random.seed(42)

arr1 = np.random.randint(0, 100, (1000, 2))
arr2 = np.random.randint(0, 100, (1000, 2))
print(setdiff2d_nb(arr1, arr2).shape)
# (736, 2)


for func in funcs:
    print(f'{func.__name__:>24s}', end='  ')
    %timeit func(arr1, arr2)
#           setdiff2d_iter  3.51 ms ± 75.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#           setdiff2d_list  2.92 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#            setdiff2d_idx  2.61 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#            setdiff2d_set  3.52 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#             setdiff2d_bc  25.6 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#             setdiff2d_nb  192 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(As a side note, setdiff2d_bc() is the most negatively affected by the size of the second input).