I have a data frame similar to
TheDF <- data.frame(VarA=rep(c(7, 8, 11, 14), 4), Var2=c(1,1,1,0, 0,2,2,0, 0,0,3,0, 0,0,4,4),
Var3=c(50, 50, 50, 50, 100, 100, 100, 100, 150, 150, 150, 150, 200, 200, 200, 200))
TheDF <- TheDF %>%
filter(Var2>0)
I need a solution that selects the combination of rows such that each value of Var1
and Var2
can only be selected once.
The result would look like (in this case there is only one solution)
Var1 Var2
7 1
8 2
11 3
14 4
This is for any number of Var1
and Var2
. If there is a non-unique solution such as:
TheDF2 <- data.frame(VarA=rep(c(7, 8, 11, 14), 5), Var2=c(1,1,1,0, 0,2,2,0, 0,0,3,0, 0,0,4,4, 0,0,5,5),
Var3=c(50, 50, 50, 50, 100, 100, 100, 100, 150, 150, 150, 150, 200, 200, 200, 200, 250, 250, 250, 250))
TheDF2 <- TheDF2 %>%
filter(Var2>0)
where Var2
values of both 4 and 5 can satisfy the requirement for uniqueness, the selected result should be based on Var3
. The selected Var2
value would be 5, because the Var3
value for 5 is 250, compared to 200 for Var4
, giving:
Var1 Var2
7 1
8 2
11 3
14 5
If both Var3
values were the same, it would be a random choice.
If it was impossible to get a unique solution for one pair, then just return the ones that were unique. E.g.
TheDF3 <- data.frame(VarA=rep(c(7, 8, 11, 14), 4), Var2=c(1,0,0,0, 0,2,2,0, 3,0,0,0, 0,0,0,4),
Var3=c(50, 50, 50, 50, 100, 100, 100, 100, 150, 150, 150, 150, 200, 200, 200, 200))
TheDF3 <- TheDF3 %>%
filter(Var2>0)
One solution would be, using the Var3
weights:
Var1 Var2
7 3
8 2
14 4
The other possible solution is:
Var1 Var2
7 3
11 2
14 4
because the sum of the weights for both solutions is the same.
Edit: I am now leaning towards a solution similar to combination method to get every combination of Var1
and Var2
, with the value of Var1
as the column headings and the value of Var2
as the cell values. With a new variable as the sum of the Var3
in the row, and another new variable that counts the number of duplicates in the row. I don't know how to implement this either.
In the current example I am working on IRL, the number of combinations is 15,972.
Edit 2: desired combination output would look like (using DF1):
7 8 11 14 Duplicates Weight
[1] 1 1 1 4 2 350
[2] 1 1 2 4 1 400
[3] 1 1 3 4 1 450
[4] 1 1 4 4 2 500
[5] 1 2 1 4 1 400
...
[n] 1 2 4 4 1 550
with VarA
turning into the column headers, and each row is each combination of Var2
values. Duplicates
is the number of duplicated Var2
values in the row, and weight is the sum of the weight for each Var2
given the VarA
value. In the simple example I gave, all the weights within a Var2
value are the same. In reality, the weights differ. For example, for Var2
2, the Var3
values might be 120 if VarA
= 8 and 245 if Var4
= 11.
In my use case, Var1
is age, Var2
is school, and Var3
is the roll count for that school for that age.
I think your problem can be interpreted as a max bipartite matching problem, so you can use max_bipartite_match
from igraph
package to solve it
library(igraph)
f <- function(df) {
# generate bipartite match
g <- graph_from_data_frame(df) %>%
set_vertex_attr(name = "type", value = names(V(.)) %in% df$VarA)
# max bipartite match
bm <- na.omit(max_bipartite_match(g)$matching)
# retrieve match pattern and yield output
v1 <- bm[bm %in% df$VarA]
v2 <- names(v1)
data.frame(
Var1 = `class<-`(v1, class(df$VarA)),
Var2 = `class<-`(v2, class(df$Var2))
)
}
and you will see the output (as per your input TheDF
, TheDF2
and TheDF3
)
> f(TheDF)
Var1 Var2
1 7 1
2 8 2
3 11 3
4 14 4
> f(TheDF2)
Var1 Var2
1 7 1
2 8 2
3 11 3
4 14 4
> f(TheDF3)
Var1 Var2
1 7 1
2 8 2
3 14 4