Imagine the following data (the code to reproduce all the outputs is at the end):
df
cars horsepower year safety
1 Toyota 140 2008 4
2 Chrysler 120 2009 4
3 Ford 140 2010 5
4 BMW 150 2008 3
5 Mercedes-Benz 150 2008 3
6 Hyundai 120 2009 4
7 Jaguar 150 2007 3
8 Tesla 120 2010 5
I'd like to swap the cars to get something like:
cars_initial cars_match horsepower year safety horsepowerMatch yearMatch safetyMatch
1 Toyota BMW 140 2008 4 150 2008 3
2 Tesla Chrysler 120 2010 5 120 2009 4
3 Mercedes-Benz Ford 150 2008 3 140 2010 5
4 Jaguar Hyundai 150 2007 3 120 2009 4
5 Hyundai Jaguar 120 2009 4 150 2007 3
6 Ford Mercedes-Benz 140 2010 5 150 2008 3
7 Chrysler Tesla 120 2009 4 120 2010 5
8 BMW Toyota 150 2008 3 140 2008 4
Now this is a typical assignment problem that was in the case above solved randomly, i.e. by cost matrix set to 0 in all cases.
What I'm interested in are the outcomes. In the above case, the solution yields the following stats:
stats
horsepower year safety
1 0.25 0.25 0
That is to say, 1/4 of swaps had an equal horsepower, etc.
Here is my question: How to solve such assignments by setting constraints on what exactly should be the outcome statistics directly, without the trial-and-error approach with setting the costs?
For instance, what if I would like to have a solution where safety
has more than 0.20 match, and year
at least 0.10, like below?
desiredOutput
cars_initial cars_match
1 Toyota Chrysler
2 Tesla Mercedes-Benz
3 Mercedes-Benz BMW
4 Jaguar Toyota
5 Hyundai Tesla
6 Ford Hyundai
7 Chrysler Jaguar
8 BMW Ford
statsDesired
horsepower year safety
1 0.25 0.12 0.25
Of course I could just set the cost matrix to a lower number in all cases where safety
of cars is equal.
But is there a way to influence the result by directly setting the constraint on what should be the outcome statistics?
Perhaps there is a way to optimize the costs in order to arrive at the desired result?
The code:
library(lpSolve)
library(dplyr)
library(tidyr)
set.seed(1)
df <- data.frame(
cars = c("Toyota", "Chrysler", "Ford", "BMW", "Mercedes-Benz", "Hyundai", "Jaguar", "Tesla"),
horsepower = c(140, 120, 140, 150, 150, 120, 150, 120),
year = c(2008, 2009, 2010, 2008, 2008, 2009, 2007, 2010),
safety = c(4, 4, 5, 3, 3, 4, 3, 5)
)
mat <- df %>% select(cars) %>%
crossing(df %>% select(cars)) %>%
mutate(val = 0) %>%
spread(cars, val)
solved <- lp.assign(mat %>% select(-cars1) %>% as.matrix())$solution
matches <- as.data.frame(solved) %>%
setNames(., names(mat %>% select(-cars1))) %>%
bind_cols(mat %>% select(cars1)) %>%
gather(key, val, -cars1) %>%
filter(val == 1) %>% select(-val, cars_initial = cars1, cars_match = key)
nms <- c("cars", paste0(names(df %>% select(-cars)), "Match"))
matches <- matches %>%
left_join(df, by = c("cars_initial" = "cars")) %>%
left_join(df %>% setNames(., nms), by = c("cars_match" = "cars"))
stats <- matches %>%
summarise(
horsepower = round(sum(horsepower == horsepowerMatch) / n(), 2),
year = round(sum(year == yearMatch) / n(), 2),
safety = round(sum(safety == safetyMatch) / n(), 2)
)
desiredOutput <- data.frame(cars_initial = matches$cars_initial, cars_match = c("Chrysler", "Mercedes-Benz", "BMW", "Toyota", "Tesla", "Hyundai", "Jaguar", "Ford"))
statsDesired <- desiredOutput %>%
left_join(df, by = c("cars_initial" = "cars")) %>%
left_join(df %>% setNames(., nms), by = c("cars_match" = "cars")) %>%
summarise(
horsepower = round(sum(horsepower == horsepowerMatch) / n(), 2),
year = round(sum(year == yearMatch) / n(), 2),
safety = round(sum(safety == safetyMatch) / n(), 2)
)
I hope the examples above are sufficient, this is my first question so please let me know if I need to provide something more.
The code is in R
, but I have also added the tag Python
as I don't really mind the language of possible solutions.
Here is a partial formulation of this problem as an integer programming (IP) problem.
Let I
be the set of car types. For car types i
and j
in I
, let:
h[i,j]
= 1 if cars i
and j
have the same horsepowery[i,j]
= 1 if cars i
and j
have the same years[i,j]
(safety)These are parameters, meaning inputs to your model. (You'll need to write code to calculate these binary quantities based on your data table.)
Now introduce the following decision variables, i.e., variables that your IP model will choose values of:
x[i,j]
= 1 if we assign car type j
as type i
's matchNow, normally an IP has an objective function that we want to minimize or maximize. In this case, there is no objective function -- you just want to find a set of matches that satisfies your constraints. So your objective function can just be:
minimize 0
Here is the first constraint. It says: At least a
of the matches must have the same horsepower. (a
is a fraction.) The left-hand side is the number of matches that have the same horsepower: For each pair of car types i
and j
, if j
is assigned as i
's match and they have the same horsepower, count a 1; otherwise, count a 0. The right-hand side is the number of matches you want, i.e., a
fraction of the whole set.
subject to sum {i in I, j in I} h[i,j] * x[i,j] >= a * |I|
Now formulate similar constraints for the other categories.
Next, you need a constraint that says each car type i
must be assigned to exactly one car type j
:
subject to sum {j in I} x[i,j] == 1 for all i in I
Finally, you need constraints saying that the decision variables are binary:
subject to x[i,j] in {0,1} for all i, j in I
Now, in terms of solving this thing, you will need to either use a mathematical modeling language like AMPL or GAMS, or a package like PuLP
for Python.
I hope this helps. I might have bitten off more than you can chew here.