Search code examples
algorithmoptimizationmatchingpyomopairing

Optimal row/column pairing algorithm


I encountered a problem while I was trying to match images with their correlation coefficient.

Say we have 5 thumbnails (a, b, c, d, e) and we need to find the best corresponding thumbnail for each one of them on another set of thumbnails (f, g, h, i, j). (One item cannot be re-used.)

For each possible pair, we compute the correlation coefficient (measurement of similarity).

     f     g     h     i     j
  |-----|-----|-----|-----|-----|
a | 0.5 | 0.7 |  0  |  0  |  0  |
  |-----|-----|-----|-----|-----|
b | 0.7 | 0.8 |  0  |  0  |  0  |
  |-----|-----|-----|-----|-----|
c |  0  |  0  |  0  |  0  | 0.8 |
  |-----|-----|-----|-----|-----|
d |  0  |  0  | 0.5 | 0.6 | 0.7 |
  |-----|-----|-----|-----|-----|
e |  0  | 0.6 | 0.7 | 0.5 |  0  |
  |-----|-----|-----|-----|-----|

What I do :

  • Find the max for each raw

         f     g     h     i     j
      |-----|-----|-----|-----|-----|
    a |  0  | 0.7 |  0  |  0  |  0  |
      |-----|-----|-----|-----|-----|
    b |  0  | 0.8 |  0  |  0  |  0  |
      |-----|-----|-----|-----|-----|
    c |  0  |  0  |  0  |  0  | 0.8 |
      |-----|-----|-----|-----|-----|
    d |  0  |  0  |  0  |  0  | 0.7 |
      |-----|-----|-----|-----|-----|
    e |  0  |  0  | 0.7 |  0  |  0  |
      |-----|-----|-----|-----|-----|
    
  • Find the max for each column

          f     g     h     i     j
       |-----|-----|-----|-----|-----|
     a |  0  |  0  |  0  |  0  |  0  |
       |-----|-----|-----|-----|-----|
     b |  0  | 0.8 |  0  |  0  |  0  |
       |-----|-----|-----|-----|-----|
     c |  0  |  0  |  0  |  0  | 0.8 |
       |-----|-----|-----|-----|-----|
     d |  0  |  0  |  0  |  0  |  0  |
       |-----|-----|-----|-----|-----|
     e |  0  |  0  | 0.7 |  0  |  0  |
       |-----|-----|-----|-----|-----|
    
  • Save those pairs in a table

  • Create a mask where the raw and the column of each number in this last table are equal to zero

          f     g     h     i     j
       |-----|-----|-----|-----|-----|
     a |  1  |  0  |  0  |  1  |  0  |
       |-----|-----|-----|-----|-----|
     b |  0  |  0  |  0  |  0  |  0  |
       |-----|-----|-----|-----|-----|
     c |  0  |  0  |  0  |  0  |  0  |
       |-----|-----|-----|-----|-----|
     d |  1  |  0  |  0  |  1  |  0  |
       |-----|-----|-----|-----|-----|
     e |  0  |  0  |  0  |  0  |  0  |
       |-----|-----|-----|-----|-----|
    
  • Multiply the mask with the first table

         f     g     h     i     j
      |-----|-----|-----|-----|-----|
    a | 0.5 |  0  |  0  |  0  |  0  |
      |-----|-----|-----|-----|-----|
    b |  0  |  0  |  0  |  0  |  0  |
      |-----|-----|-----|-----|-----|
    c |  0  |  0  |  0  |  0  |  0  |
      |-----|-----|-----|-----|-----|
    d |  0  |  0  |  0  | 0.6 |  0  |
      |-----|-----|-----|-----|-----|
    e |  0  |  0  |  0  |  0  |  0  |
      |-----|-----|-----|-----|-----|
    
  • Repeat the process until the matrix obtained at the second step is equal to zero

So at the end, the matching table looks like that :

        f     g     h     i     j
     |-----|-----|-----|-----|-----|
   a |  1  |  0  |  0  |  0  |  0  |
     |-----|-----|-----|-----|-----|
   b |  0  |  1  |  0  |  0  |  0  |
     |-----|-----|-----|-----|-----|
   c |  0  |  0  |  0  |  0  |  1  |
     |-----|-----|-----|-----|-----|
   d |  0  |  0  |  0  |  1  |  0  |
     |-----|-----|-----|-----|-----|
   e |  0  |  0  |  1  |  0  |  0  |
     |-----|-----|-----|-----|-----|

According the this method, the best pairs possible are : (a,f), (b,g), (c,j), (d,i) and (e,h)

Now the question is : Is there a better method?

Like for (a,b) and (f,g), wouldn't be better to add up their scores to find the best match?

Ex :

  (a,f) (b,g)
   0.5 + 0.7 = 1.2

  (a,g) (b,f)
   0.7 + 0.7 = 1.4

   1.4 > 1.2 => best pairs are (a,g) and (b,f)
   (As opposed to (a,f), (b,g) with the first method.)

If so, how to make it generalizable?

I hope that I've been clear enough to make you understand the problem.

Thanks in advance for your help.

EDIT :

I found out that the Hungarian algorithm is much faster than the ILP solution provided by AirSquid.

I compared the Hungarian implementation of Scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html) with the ILP based solution.

After 1000 one-to-one matching iterations of a random 20x20 matrix I got :

Method ite/s
ILP solution 4.06e-2
Hungarian algorithm 1.808e-5

From tests, I haven't seen any differences between those two methods.


Solution

  • This is a trivial pairing model for most any math solver and can be formulated as an ILP. If you wish to go this route in python, you have several choices (after learning a bit about LP/ILP formulation :) ). I'm partial to pyomo but pulp and or-tools are also viable. You will also need a solver engine. There are several freebies out there, some are easier to install than others. I believe pulp has a built-in solver, which is nice.

    There is probably a dynamic programming solution to consider as well, but this is fast & easy. For the examples I note in the problem below (a replica of the counter-example above and a random 20x20 matrix), optimal solutions are almost instantaneous.

    # pairing
    
    import pyomo.environ as pyo
    import numpy as np
    
    data = [[.99, .98, .97, .96, .95],
            [.98, .97, .96, .95, 0],
            [.97, .96, .95, 0,   0],
            [.96, .95, 0,   0,   0],
            [.95, 0,   0,   0,   0]]
    
    #data = np.random.rand(20, 20)  #alternate random data for testing...
    
    model = pyo.ConcreteModel('r-c pairings')
    
    #re-label the data and push into a dictionary
    labels = list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
    data = {(labels[r], labels[len(data) + c]) : data[r][c] 
                for r in range(len(data)) for c in range(len(data[0]))}
    
    # pyomo components
    model.R = pyo.Set(initialize = [k[0] for k in data.keys()])
    model.C = pyo.Set(initialize = [k[1] for k in data.keys()])
    model.corr = pyo.Param(model.R, model.C, initialize=data)
    model.X = pyo.Var(model.R, model.C, within=pyo.Binary)  # select pairing (r, c)
    
    
    # objective:  maximize overall value
    model.obj = pyo.Objective(expr=pyo.summation(model.corr, model.X), sense=pyo.maximize)  #shortcut to ∑cX
    
    # constraint:  only use each column value once
    def single_use(m, c):
        return sum(model.X[r,c] for r in model.R) <= 1
    model.C1 = pyo.Constraint(model.C, rule=single_use)
    
    # constraint:  only use each row value once
    def single_use_row(m, r):
        return sum(model.X[r,c] for c in model.C) <= 1
    model.C2 = pyo.Constraint(model.R, rule=single_use_row)
    
    # solve it...
    solver = pyo.SolverFactory('glpk')  # <-- need to have this solver installed
    result = solver.solve(model)
    print(result)
    pyo.display(model)
    

    Output (from the smaller data run):

    Problem: 
    - Name: unknown
      Lower bound: 4.75
      Upper bound: 4.75
      Number of objectives: 1
      Number of constraints: 11
      Number of variables: 26
      Number of nonzeros: 51
      Sense: maximize
    Solver: 
    - Status: ok
      Termination condition: optimal
      Statistics: 
        Branch and bound: 
          Number of bounded subproblems: 1
          Number of created subproblems: 1
      Error rc: 0
      Time: 0.010313272476196289
    Solution: 
    - number of solutions: 0
      number of solutions displayed: 0
    
    Model r-c pairings
    
      Variables:
        X : Size=25, Index=X_index
            Key        : Lower : Value : Upper : Fixed : Stale : Domain
            ('a', 'f') :     0 :   0.0 :     1 : False : False : Binary
            ('a', 'g') :     0 :   0.0 :     1 : False : False : Binary
            ('a', 'h') :     0 :   0.0 :     1 : False : False : Binary
            ('a', 'i') :     0 :   0.0 :     1 : False : False : Binary
            ('a', 'j') :     0 :   1.0 :     1 : False : False : Binary
            ('b', 'f') :     0 :   0.0 :     1 : False : False : Binary
            ('b', 'g') :     0 :   0.0 :     1 : False : False : Binary
            ('b', 'h') :     0 :   0.0 :     1 : False : False : Binary
            ('b', 'i') :     0 :   1.0 :     1 : False : False : Binary
            ('b', 'j') :     0 :   0.0 :     1 : False : False : Binary
            ('c', 'f') :     0 :   0.0 :     1 : False : False : Binary
            ('c', 'g') :     0 :   0.0 :     1 : False : False : Binary
            ('c', 'h') :     0 :   1.0 :     1 : False : False : Binary
            ('c', 'i') :     0 :   0.0 :     1 : False : False : Binary
            ('c', 'j') :     0 :   0.0 :     1 : False : False : Binary
            ('d', 'f') :     0 :   0.0 :     1 : False : False : Binary
            ('d', 'g') :     0 :   1.0 :     1 : False : False : Binary
            ('d', 'h') :     0 :   0.0 :     1 : False : False : Binary
            ('d', 'i') :     0 :   0.0 :     1 : False : False : Binary
            ('d', 'j') :     0 :   0.0 :     1 : False : False : Binary
            ('e', 'f') :     0 :   1.0 :     1 : False : False : Binary
            ('e', 'g') :     0 :   0.0 :     1 : False : False : Binary
            ('e', 'h') :     0 :   0.0 :     1 : False : False : Binary
            ('e', 'i') :     0 :   0.0 :     1 : False : False : Binary
            ('e', 'j') :     0 :   0.0 :     1 : False : False : Binary
    
      Objectives:
        obj : Size=1, Index=None, Active=True
            Key  : Active : Value
            None :   True :  4.75
    
      Constraints:
        C1 : Size=5
            Key : Lower : Body : Upper
              f :  None :  1.0 :   1.0
              g :  None :  1.0 :   1.0
              h :  None :  1.0 :   1.0
              i :  None :  1.0 :   1.0
              j :  None :  1.0 :   1.0
        C2 : Size=5
            Key : Lower : Body : Upper
              a :  None :  1.0 :   1.0
              b :  None :  1.0 :   1.0
              c :  None :  1.0 :   1.0
              d :  None :  1.0 :   1.0
              e :  None :  1.0 :   1.0