Search code examples
pythonarraysnumpy

Best way to match between arrays of different length


I have two sorted numerical arrays of unequal length and I'm looking for a way to match between elements in both arrays such that there's a one-to-one match between (most) gt and pred elements. A match is only considered valid if it is below some threshold thrsh (so there could be unmatched gt elements), and we define a cost of a match as the difference between matched gt and pred elements.
From here we can define three types of matches:

  • First Match takes the first valid match for each gt element, as shown in this answer.
  • Greedy Match tries to match as many gt elements as possible, regardless of the "cost" of matching.
  • Min Cost Match is the match with minimal "cost".

For example:

thrsh=3
gt = [4,8,15,16,23,42,45]
pred = [4,5,7,16,19,44]
# some matching voodoo...
# first_match = [(4,4),(8,5),(15,16),(16,19),(42,44)]  # gt23, gt45 and pred5 are unmatched
# greedy_match is the same as first_match
# min_cost_match = [(4,4),(8,7),(16,16),(45,44)]  # gt15 and pred19 now unmatched as well

I believe First Match and Greedy Match are always the same (up to the last matched pair), and as mentioned, there's an implementation of First Match here. However, I can't find a way to implement the Min Cost Match, as any iteration may also change matches for previous iteration (depending on how large thrsh is, this could be very computationally heavy).


Solution

  • Seems like a DP-ish problem that can be solved if you make an MxN table, and do some LCS-like traversal. In numpy, you can do this:

    thrsh=3
    gt = np.array([4,8,15,16,23,42,45])
    pred = np.array([4,5,7,16,19,44])
    
    table = abs(gt[:, None] - pred[None, :])
    
    # array([[ 0,  1,  3, 12, 15, 40],
    #       [ 4,  3,  1,  8, 11, 36],
    #       [11, 10,  8,  1,  4, 29],
    #       [12, 11,  9,  0,  3, 28],
    #       [19, 18, 16,  7,  4, 21],
    #       [38, 37, 35, 26, 23,  2],
    #       [41, 40, 38, 29, 26,  1]])
    

    It is easy to observe the pattern here. Any item which is the minimum cost both row-wise and column-wise (and is also within the threshold), must be a part of the min cost match.

    rowwise = np.stack([table.argmin(0), np.arange(table.shape[1])]).T
    colwise = np.stack([np.arange(table.shape[0]), table.argmin(1)]).T
    rc = (rowwise[:, None] == colwise).all(-1).any(1)
    
    idxs = rowwise[rc]
    out = np.stack([gt[idxs[:, 0]], pred[idxs[:, 1]]]).T
    
    # array([[ 4,  4],
    #       [ 8,  7],
    #       [16, 16],
    #       [45, 44]])
    

    After you obtain out, you can then maybe filter it to see if any match is over the threshold.