I am a little out of my element here: I am a statistician who mainly works in SAS and R. I am attempting to code what is essentially the classic knapsack problem in Python using Pulp. My data look something like this (this is a simplified version):
The fan.csv example data
id Fan Cost Switch
10001 32.83 3.18 A/B
10003 75.25 4.20 C
10005 71.60 8.79 A/B
10010 24.23 9.99 C/D
10011 98.69 8.88 D
10012 48.81 8.68 C/D
I need to maximize Fan while choosing 1 or 2 items in each Switch category (e.g. 1 "A", 2 "Ds", etc.) all while coming in under a Cost threshold. So far, pretty standard.
My issue is that some items can be in one or the other Switch category (see obs with '/' in the Switch value - e.g., id 10001 can serve as either 'A' or 'B'). Because I am a statistician, I think in terms of n x k datasets and in this case I would set the data up in long form (as if these were repeated measures data):
id Fan Cost Switch
10001 32.83 3.18 A
10001 32.83 3.18 B
10003 75.25 4.20 C
10005 71.60 8.79 A
10005 71.60 8.79 B
10010 24.23 9.99 C
10010 24.23 9.99 D
10011 98.69 8.88 D
10012 48.81 8.68 C
10012 48.81 8.68 D
HOWEVER, I have essentially taken someone else's example code to call the Pulp optimization model and I adapted it to my situation. My current problem is that it is unclear to me how to add constraints to the linear optimization model so that each item (each id) is selected at most just once. This is an issue because of the way I 'unpacked' the Switch variable and listed multiple id's with different Switch values.
My code appear below. In step 1, I convert the fan.csv data into long form (I wrote this code myself). Step 2 is mostly the example data that I adapted. The sample code organizes the data in nested dictionaries and then calls the Pulp functions to specify the optimization model.
import os
import pandas as pd
from pulp import *
import numpy as np
# Step 1
os.chdir("E:\\")
df01 = pd.read_csv("fan.csv")
df02 = df01[df01['Switch'].str.contains('/')]
df03 = df01[~df01['Switch'].str.contains('/')]
df02[['Swit1','Swit2']] = df02.Switch.str.split("/", expand=True)
df04 = pd.wide_to_long(df02, ["Swit"], i="id", j="Pos")
df04.reset_index(inplace=True)
df04.drop(["Pos"], axis=1, inplace=True)
df05 = pd.concat([df04,df03])
df06 = df05.sort_values(by=['id'])
df06.reset_index(drop=True, inplace=True)
df06.loc[(~df06['Switch'].str.contains('/')), 'Swit'] = df06.Switch
# Step 2
df07 = df06.groupby(["Swit", "id", "Fan", "Cost"]).agg('count')
df07 = df07.reset_index()
costs = {}
fans = {}
for sth in df07.Swit.unique():
df07_sth = df07[df07.Swit == sth]
cost = list(df07_sth[['id', 'Cost']].set_index("id").to_dict().values())[0]
fan = list(df07_sth[['id', 'Fan']].set_index("id").to_dict().values())[0]
costs[sth] = cost
fans[sth] = fan
sth_num_available = {
"A": 1,
"B": 1,
"C": 1,
"D": 2,
}
cost_cap = 25
_vars = {k: LpVariable.dict(k, v, cat='Binary') for k, v in fans.items()}
model1 = LpProblem("Fans", LpMaximize)
fanval = []
costval = []
switch_constraints = []
for k, v in _vars.items():
costval += lpSum([costs[k][i] * _vars[k][i] for i in v])
fanval += lpSum([fans[k][i] * _vars[k][i] for i in v])
model1 += lpSum([_vars[k][i] for i in v]) == sth_num_available[k]
model1 += lpSum(fanval)
model1 += lpSum(costval) <= cost_cap
model1
model1.solve()
I have 2 additional constraints to add. I am including this in this post b/c this issue this related to the initial post.
The expanded fan.csv example data are:
id Fan Cost Switch Hit Group Cat
10001 32.83 3.18 A/B 0 G1 C1
10003 75.25 4.20 C 1 G1 C2
10005 71.60 8.79 A/B 0 G1 C3
10010 24.23 9.99 C/D 1 G2 C1
10011 98.69 8.88 D 1 G2 C2
10012 48.81 8.68 C/D 1 G2 C3
The knapsack solution can include a maximum of n ids coded 1 for Hit from any one Group.
I tried adapting Reinderien's code for this constraint. Here n is set to 2:
df02 = df01.copy()
df02.set_index(["id","Group",], drop=False, inplace=True)
for G, total in df02.Assign.groupby(level='Group').sum().items():
model1.addConstraint(name='Group' + G + 'Hit', constraint=df02.Hit.dot(total) <= 2)
But I get this error:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Users\Programs\Python\Python311\Lib\site-packages\pandas\core\series.py", line 3015, in dot
if lvals.shape[0] != rvals.shape[0]:
~~~~~~~~~~~^^^
IndexError: tuple index out of range
>>>
Additionally, the knapsack solution must consist of at least 2 of the categories (Cat).
Here is my adapted code:
df04 = df02.copy()
df04.reset_index(drop=True, inplace=True)
df04.set_index(["id","Cat",], drop=False, inplace=True)
for category, total in df04.Assign.groupby(level='Cat').sum().items():
model1.addConstraint(name='Cat_' + category, constraint=total >= 2)
This runs without error and BUT DOES NOT represent the intended constraint.
============================================
The other answer I linked uses scipy. Here I demonstrate PuLP because it's what you were using. Aside:
[I mainly work] in SAS and R
R has linear program support of its own, so why switch to Python? Anyway, the main gist of the demonstration below is that multi-indexing makes things easy:
import numpy as np
import pandas as pd
import pulp
df = (
pd.read_fwf('fan.csv')
.set_index('id')
.rename(columns={'Switch': 'SwitchOptions'})
)
# arbitrary columns, id index, values are switch string fragments
exploded_switch = df.SwitchOptions.str.split('/', expand=True)
# A, B, C, D
switch_values = pd.unique(exploded_switch.stack())
# ABCD columns, id index, values are True or NaN to indicate switch match
switch_sparse = pd.DataFrame(
(
exploded_switch.values[:, :, np.newaxis]
== switch_values[np.newaxis, np.newaxis, :]
).any(axis=1),
index=df.index,
columns=pd.Index(name='Switch', data=switch_values),
)
switch_sparse[~switch_sparse] = np.nan
# Multi-index to 10 ID/switch values
df = switch_sparse.stack().align(df)[1]
# Reasonable assignment variable names; will be unique
var_names = (
'asn_i'
+ df.index.get_level_values('id').astype(str)
+ '_s'
+ df.index.get_level_values('Switch')
).to_series(index=df.index)
# Assignment variables, named, binary, with same ID/switch multi-index
df['Assign'] = var_names.apply(pulp.LpVariable, cat=pulp.LpBinary)
# Indexed total switch availability series
switch_avail = pd.Series(
index=switch_sparse.columns,
data=(1, 1, 1, 2),
)
# Maximize total Fan via dot product
model = pulp.LpProblem(name='Fans', sense=pulp.LpMaximize)
model.objective = df.Assign.dot(df.Fan)
model.addConstraint(name='cost_cap', constraint=df.Assign.dot(df.Cost) <= 25)
# Constrain assignments by switch availability
for switch, total in df.Assign.groupby(level='Switch').sum().items():
model.addConstraint(name='switch_' + switch, constraint=total <= switch_avail[switch])
# Constrain IDs to be assigned exclusively. Only add constraints where more than one switch is possible.
id_groups = df.Assign.groupby(level='id')
for ID, total in id_groups.sum()[id_groups.count() > 1].items():
model.addConstraint(name=f'excl_id_{ID}', constraint=total <= 1)
# include a maximum of n ids coded 1 for Hit from any one Group
for group_name, total in df[df.Hit == 1].groupby('Group').Assign.sum().items():
model.addConstraint(name=f'hit_group_{group_name}', constraint=total <= 2)
cat_names = pd.Series(name='Cat', data=df.Cat.unique())
cat_used = cat_names.apply(pulp.LpVariable, cat=pulp.LpBinary)
cat_used.index = pd.Index(name='Cat', data=cat_names)
# the knapsack solution must consist of at least 2 of the categories
model.addConstraint(name='cats_used', constraint=cat_used.sum() >= 2)
# if the ID is assigned, its category must be assigned
for (ID, switch), row in df.iterrows():
model.addConstraint(name=f'cat_i{ID}_s{switch}', constraint=row.Assign <= cat_used[row.Cat])
# if none of the IDs in the category are assigned, the category must not be assigned
for cat, total in df.groupby('Cat').Assign.sum().items():
model.addConstraint(name=f'cat_{cat}', constraint=cat_used[cat] <= total)
print(model)
model.solve()
assert model.status == pulp.LpStatusOptimal
df['Assign'] = df.Assign.apply(pulp.LpVariable.value)
print(df)
Fans:
MAXIMIZE
32.83*asn_i10001_sA + 32.83*asn_i10001_sB + 75.25*asn_i10003_sC + 71.6*asn_i10005_sA + 71.6*asn_i10005_sB + 24.23*asn_i10010_sC + 24.23*asn_i10010_sD + 98.69*asn_i10011_sD + 48.81*asn_i10012_sC + 48.81*asn_i10012_sD + 0.0
SUBJECT TO
cost_cap: 3.18 asn_i10001_sA + 3.18 asn_i10001_sB + 4.2 asn_i10003_sC
+ 8.79 asn_i10005_sA + 8.79 asn_i10005_sB + 9.99 asn_i10010_sC
+ 9.99 asn_i10010_sD + 8.88 asn_i10011_sD + 8.68 asn_i10012_sC
+ 8.68 asn_i10012_sD <= 25
switch_A: asn_i10001_sA + asn_i10005_sA <= 1
switch_B: asn_i10001_sB + asn_i10005_sB <= 1
switch_C: asn_i10003_sC + asn_i10010_sC + asn_i10012_sC <= 1
switch_D: asn_i10010_sD + asn_i10011_sD + asn_i10012_sD <= 2
excl_id_10001: asn_i10001_sA + asn_i10001_sB <= 1
excl_id_10005: asn_i10005_sA + asn_i10005_sB <= 1
excl_id_10010: asn_i10010_sC + asn_i10010_sD <= 1
excl_id_10012: asn_i10012_sC + asn_i10012_sD <= 1
hit_group_G1: asn_i10003_sC <= 2
hit_group_G2: asn_i10010_sC + asn_i10010_sD + asn_i10011_sD + asn_i10012_sC
+ asn_i10012_sD <= 2
cats_used: C1 + C2 + C3 >= 2
cat_i10001_sA: - C1 + asn_i10001_sA <= 0
cat_i10001_sB: - C1 + asn_i10001_sB <= 0
cat_i10003_sC: - C2 + asn_i10003_sC <= 0
cat_i10005_sA: - C3 + asn_i10005_sA <= 0
cat_i10005_sB: - C3 + asn_i10005_sB <= 0
cat_i10010_sC: - C1 + asn_i10010_sC <= 0
cat_i10010_sD: - C1 + asn_i10010_sD <= 0
cat_i10011_sD: - C2 + asn_i10011_sD <= 0
cat_i10012_sC: - C3 + asn_i10012_sC <= 0
cat_i10012_sD: - C3 + asn_i10012_sD <= 0
cat_C1: C1 - asn_i10001_sA - asn_i10001_sB - asn_i10010_sC - asn_i10010_sD
<= 0
cat_C2: C2 - asn_i10003_sC - asn_i10011_sD <= 0
cat_C3: C3 - asn_i10005_sA - asn_i10005_sB - asn_i10012_sC - asn_i10012_sD
<= 0
VARIABLES
0 <= C1 <= 1 Integer
0 <= C2 <= 1 Integer
0 <= C3 <= 1 Integer
0 <= asn_i10001_sA <= 1 Integer
0 <= asn_i10001_sB <= 1 Integer
0 <= asn_i10003_sC <= 1 Integer
0 <= asn_i10005_sA <= 1 Integer
0 <= asn_i10005_sB <= 1 Integer
0 <= asn_i10010_sC <= 1 Integer
0 <= asn_i10010_sD <= 1 Integer
0 <= asn_i10011_sD <= 1 Integer
0 <= asn_i10012_sC <= 1 Integer
0 <= asn_i10012_sD <= 1 Integer
Welcome to the CBC MILP Solver
Version: 2.10.3
Build Date: Dec 15 2019
At line 2 NAME MODEL
At line 3 ROWS
At line 30 COLUMNS
At line 137 RHS
At line 163 BOUNDS
At line 177 ENDATA
Problem MODEL has 25 rows, 13 columns and 70 elements
...
Result - Optimal solution found
Objective value: 255.58000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.01
Option for printingOptions changed from normal to all
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.01
Fan Cost SwitchOptions Hit Group Cat Assign
id Switch
10001 A 32.83 3.18 A/B 0 G1 C1 0.0
B 32.83 3.18 A/B 0 G1 C1 1.0
10003 C 75.25 4.20 C 1 G1 C2 1.0
10005 A 71.60 8.79 A/B 0 G1 C3 0.0
B 71.60 8.79 A/B 0 G1 C3 0.0
10010 C 24.23 9.99 C/D 1 G2 C1 0.0
D 24.23 9.99 C/D 1 G2 C1 0.0
10011 D 98.69 8.88 D 1 G2 C2 1.0
10012 C 48.81 8.68 C/D 1 G2 C3 0.0
D 48.81 8.68 C/D 1 G2 C3 1.0