Search code examples
ralgorithmperformanceigraphcombinatorics

How do I select a combination of two variables from a dataframe when each variable value can only be selected once?


I have a data frame similar to

 TheDF <- data.frame(VarA=rep(c(7, 8, 11, 14), 4), Var2=c(1,1,1,0, 0,2,2,0, 0,0,3,0, 0,0,4,4), 
                Var3=c(50, 50, 50, 50, 100, 100, 100, 100, 150, 150, 150, 150, 200, 200, 200, 200))
 TheDF <- TheDF %>%
    filter(Var2>0)

I need a solution that selects the combination of rows such that each value of Var1 and Var2 can only be selected once.

The result would look like (in this case there is only one solution)

 Var1      Var2
 7          1
 8          2
 11         3
 14         4

This is for any number of Var1 and Var2. If there is a non-unique solution such as:

 TheDF2 <- data.frame(VarA=rep(c(7, 8, 11, 14), 5), Var2=c(1,1,1,0, 0,2,2,0, 0,0,3,0, 0,0,4,4, 0,0,5,5), 
                Var3=c(50, 50, 50, 50, 100, 100, 100, 100, 150, 150, 150, 150, 200, 200, 200, 200, 250, 250, 250, 250))
TheDF2 <- TheDF2 %>%
   filter(Var2>0)

where Var2 values of both 4 and 5 can satisfy the requirement for uniqueness, the selected result should be based on Var3. The selected Var2 value would be 5, because the Var3 value for 5 is 250, compared to 200 for Var4, giving:

 Var1   Var2
 7      1
 8      2
 11     3
 14     5

If both Var3 values were the same, it would be a random choice.

If it was impossible to get a unique solution for one pair, then just return the ones that were unique. E.g.

 TheDF3 <- data.frame(VarA=rep(c(7, 8, 11, 14), 4), Var2=c(1,0,0,0, 0,2,2,0, 3,0,0,0, 0,0,0,4), 
                 Var3=c(50, 50, 50, 50, 100, 100, 100, 100, 150, 150, 150, 150, 200, 200, 200, 200))
TheDF3 <- TheDF3 %>%
  filter(Var2>0)

One solution would be, using the Var3 weights:

 Var1    Var2
 7       3
 8       2
 14      4     

The other possible solution is:

 Var1     Var2
 7        3
 11       2
 14       4

because the sum of the weights for both solutions is the same.

Edit: I am now leaning towards a solution similar to combination method to get every combination of Var1 and Var2, with the value of Var1 as the column headings and the value of Var2 as the cell values. With a new variable as the sum of the Var3 in the row, and another new variable that counts the number of duplicates in the row. I don't know how to implement this either.

In the current example I am working on IRL, the number of combinations is 15,972.

Edit 2: desired combination output would look like (using DF1):

        7    8    11   14     Duplicates   Weight
 [1]    1    1    1    4      2            350
 [2]    1    1    2    4      1            400
 [3]    1    1    3    4      1            450
 [4]    1    1    4    4      2            500
 [5]    1    2    1    4      1            400
 ...
 [n]    1    2    4    4      1            550

with VarA turning into the column headers, and each row is each combination of Var2 values. Duplicates is the number of duplicated Var2 values in the row, and weight is the sum of the weight for each Var2 given the VarA value. In the simple example I gave, all the weights within a Var2 value are the same. In reality, the weights differ. For example, for Var2 2, the Var3 values might be 120 if VarA = 8 and 245 if Var4 = 11.

In my use case, Var1 is age, Var2 is school, and Var3 is the roll count for that school for that age.


Solution

  • I think your problem can be interpreted as a max bipartite matching problem, so you can use max_bipartite_match from igraph package to solve it

    library(igraph)
    f <- function(df) {
        # generate bipartite match
        g <- graph_from_data_frame(df) %>%
            set_vertex_attr(name = "type", value = names(V(.)) %in% df$VarA)
        # max bipartite match
        bm <- na.omit(max_bipartite_match(g)$matching)
        # retrieve match pattern and yield output
        v1 <- bm[bm %in% df$VarA]
        v2 <- names(v1)
        data.frame(
            Var1 = `class<-`(v1, class(df$VarA)),
            Var2 = `class<-`(v2, class(df$Var2))
        )
    }
    

    and you will see the output (as per your input TheDF, TheDF2 and TheDF3)

    > f(TheDF)
      Var1 Var2
    1    7    1
    2    8    2
    3   11    3
    4   14    4
    
    > f(TheDF2)
      Var1 Var2
    1    7    1
    2    8    2
    3   11    3
    4   14    4
    
    > f(TheDF3)
      Var1 Var2
    1    7    1
    2    8    2
    3   14    4