Search code examples
pythonnumpymultidimensional-arraycorrespondence

Numpy: Finding correspondences of 2D Array in another 2D array, without taking order in account


I have two numpy matrices with with different shapes:

vertices = np.array([[ 100, 101, 102, 103,  -1],
                     [ 200, 201, 202, 203, 204],
                     [ 300, 301, 302, 303, 104],
                     [ 505, 506, 507,  -1,  -1]])

faces = np.array([[ 104, 102,  100],
                  [1202, 203, 2000],
                  [ 303, 505,  104],
                  [ 101, 102,  104]])

I want to link the index of each row in faces to rows in vertices in order to compute the correspdoning area for each vertex. The area computation is not in this post because it is irrelevant.

Each vertex can correspond to any number of faces. Each face can only be attached to one vertex.

Link rule: For each row in vertices, if 2 or more elements exist in a faces row, those rows are linked.

Expected result is a dictionary with keys that are indices into vertices, and values that are the count of linked faces for the corresponding vertex.

Expected result:

{0: 2.0, 2: 1.0}

I wrote a working algorithm but I am looking for a more performant implementation:

def area(faces, corresponding_vertices_id):
    skeleton_node_corresponding_area = defaultdict(lambda: 0.)
    for face in faces:
        for skeleton_node, vectex_id in enumerate(corresponding_vertices_id):
            if np.count_nonzero(np.in1d(face, vectice_id)) >=2 :
                area = 1 # Mock action. real area will be compute later
                skeleton_node_corresponding_area[skeleton_node] = skeleton_node_corresponding_area[skeleton_node] + area
                break
    return skeleton_node_corresponding_area

area(faces, vertices)

defaultdict(<function __main__.area.<locals>.<lambda>()>, {0: 2.0, 2: 1.0})

Solution

  • in1d(a, b) is essentially doing this (but likely without the intermediate arrays, and some short circuiting):

    (a == b[:, None]).any(axis=0)
    

    You can speed this up a bit by pre-sorting the arrays. This should be valid as long as you only care that two numbers match:

    vertices.sort(axis=1)
    

    Now you can do something like

    result = defaultdict(float)
    for face in faces:
        for i, vertex in enumerate(vertices):
            n = vertex[np.searchsorted(vertex, face) % vertices.shape[1]] == face).sum()
            if n > 2:
                result[i] += 1.
                break
    

    This works by doing a binary search of each element of faces in each row of vertices. Since searchsorted returns insertion indices, you have to check which locations actually match the value. The modulo operator ensures that elements with insertion index past the end of the array do not cause an IndexError. The proper way to handle that would be a conditional, but setting their indices to zero is faster, and works out OK since they don't match either way. The break speeds things up since you mention that each face can only belong to a single vertex.

    This is still not terribly fast. Assuming a pair of arrays of shape (M, N), your algorithm is O(M^2 * N^2). This one performs an O(M * N * log(N)) sort, followed by a nested loop that is O(M * M * N * log(N)) (each search is O(log(N)), but it has to be done for each element of the row), for total of O(M^2 * N * log(N)).

    A more performant approach (once you get past the point where the extra overhead matters) would be to use python sets, since search for an individual element is O(1):

    vertices = [set(vertex) for vertex in vertices]
    faces = [set(face) for face in faces]
    result = defaultdict(float)
    
    for face in faces:
        for i, vertex in enumerate(vertices):
            if len(face & vertex) > 2:
                result[face] += 1
                break
    

    If area depends on the original value of face and vertex before conversion to set, modify your loops to run over the index only, and access what you need by index.

    This is O(M * M * N) since face & vertex is an O(N) operation. If you had a non-random distribution of values, you could probably do something that would allow you to reduce the complexity of the outer loops, e.g., by sorting the rows or so.