Search code examples
runique-constraintpairingmaximization

Unique element pairing between two vectors maximizing the overall sum


I have a data frame that contains all the possible combinations between the elements of two vectors and for each combination I have a corresponding score. I was trying to find an efficient way to find the subset of unique pairs with unique elements (i.e., an element from one vector can be found only once across all pairs) that maximizes the sum of scores corresponding to each combination. As example data, consider this df:

df = data.frame(Var1 = c("A", "B", "C"), Var2 = c("A", "C", "D"))
df = expand.grid(df$Var1, df$Var2)
df$score = c(1, 0.5, 2, 1, 0.5, 0.5, 1, 2, 1)
> df
  Var1 Var2 score
1    A    A   1.0
2    B    A   0.5
3    C    A   2.0
4    A    C   1.0
5    B    C   0.5
6    C    C   0.5
7    A    D   1.0
8    B    D   2.0
9    C    D   1.0
> 

The expected result would be:

A  C  1
B  D  2
C  A  2

Note that you can have overlap between the elements of the two vectors, but again each element from each vector should appear only once. Also, the pair A A 1 is allowed and would've been possible, but that would make it impossible then to generate the pair C A 2 which would increase the overall sum of the score. As an attempt I have used this one liner with the functionality from dplyr

df <- df %>% group_by(Var1) %>% slice(which.max(score)) %>% as.data.frame()

which produces:

> df
  Var1 Var2 score
1    A    A     1
2    B    D     2
3    C    A     2

which is close enough.. but the A from the second vector is repeated. Do you have any suggestions? Thank you in advance!


Solution

  • Well, I eventually found the solution based on the Hungarian algorithm implemented in the solve_LSAP function of the clue R package. To have it working, transform your df in a matrix like so:

    df = matrix(sapply(df$score, function(x) x), nrow=length(unique(df$Var1)), ncol=length(unique(df$Var2)), dimnames = list(unique(df$Var1), unique(df$Var2)))
    

    and apply the function

    df.res = solve_LSAP(df, maximum = T)
    > df.res
    Optimal assignment:
    1 => 2, 2 => 3, 3 => 1
    

    and then get back the actual nodes or names

    df.res = cbind(rownames(df), colnames(df)[df.res])
    > df.res
         [,1] [,2]
    [1,] "A"  "C" 
    [2,] "B"  "D" 
    [3,] "C"  "A" 
    > 
    

    Tadaaaaam!