Python Pulp - 0/1 Knapsack Problem - How to Ensure Items (IDs) are Selected At Most Once

I am a little out of my element here: I am a statistician who mainly works in SAS and R. I am attempting to code what is essentially the classic knapsack problem in Python using Pulp. My data look something like this (this is a simplified version):

The fan.csv example data

id       Fan    Cost    Switch
10001   32.83   3.18    A/B
10003   75.25   4.20    C
10005   71.60   8.79    A/B
10010   24.23   9.99    C/D
10011   98.69   8.88    D
10012   48.81   8.68    C/D

I need to maximize Fan while choosing 1 or 2 items in each Switch category (e.g. 1 "A", 2 "Ds", etc.) all while coming in under a Cost threshold. So far, pretty standard.

My issue is that some items can be in one or the other Switch category (see obs with '/' in the Switch value - e.g., id 10001 can serve as either 'A' or 'B'). Because I am a statistician, I think in terms of n x k datasets and in this case I would set the data up in long form (as if these were repeated measures data):

id       Fan    Cost    Switch
10001   32.83   3.18    A
10001   32.83   3.18    B
10003   75.25   4.20    C
10005   71.60   8.79    A
10005   71.60   8.79    B
10010   24.23   9.99    C
10010   24.23   9.99    D
10011   98.69   8.88    D
10012   48.81   8.68    C
10012   48.81   8.68    D

HOWEVER, I have essentially taken someone else's example code to call the Pulp optimization model and I adapted it to my situation. My current problem is that it is unclear to me how to add constraints to the linear optimization model so that each item (each id) is selected at most just once. This is an issue because of the way I 'unpacked' the Switch variable and listed multiple id's with different Switch values.

My code appear below. In step 1, I convert the fan.csv data into long form (I wrote this code myself). Step 2 is mostly the example data that I adapted. The sample code organizes the data in nested dictionaries and then calls the Pulp functions to specify the optimization model.

import os
import pandas as pd
from pulp import *
import numpy as np

# Step 1

os.chdir("E:\\")

df01 = pd.read_csv("fan.csv")

df02 = df01[df01['Switch'].str.contains('/')]
df03 = df01[~df01['Switch'].str.contains('/')]

df02[['Swit1','Swit2']] = df02.Switch.str.split("/", expand=True)
df04 = pd.wide_to_long(df02, ["Swit"], i="id", j="Pos")
df04.reset_index(inplace=True)
df04.drop(["Pos"], axis=1, inplace=True)

df05 = pd.concat([df04,df03])
df06 = df05.sort_values(by=['id'])
df06.reset_index(drop=True, inplace=True)

df06.loc[(~df06['Switch'].str.contains('/')), 'Swit'] = df06.Switch

# Step 2    

df07 = df06.groupby(["Swit", "id", "Fan", "Cost"]).agg('count')
df07 = df07.reset_index()

costs = {}
fans = {}

for sth in df07.Swit.unique():
    df07_sth = df07[df07.Swit == sth]
    cost = list(df07_sth[['id', 'Cost']].set_index("id").to_dict().values())[0]
    fan = list(df07_sth[['id', 'Fan']].set_index("id").to_dict().values())[0]
    costs[sth] = cost
    fans[sth] = fan


sth_num_available = {
    "A": 1,
    "B": 1,
    "C": 1,
    "D": 2,
                    }

cost_cap = 25

_vars = {k: LpVariable.dict(k, v, cat='Binary') for k, v in fans.items()}

model1 = LpProblem("Fans", LpMaximize)
fanval = []
costval = []
switch_constraints = []

for k, v in _vars.items():
    costval += lpSum([costs[k][i] * _vars[k][i] for i in v])
    fanval += lpSum([fans[k][i] * _vars[k][i] for i in v])
    model1 += lpSum([_vars[k][i] for i in v]) == sth_num_available[k]

model1 += lpSum(fanval)
model1 += lpSum(costval) <= cost_cap
model1

model1.solve()

To be clear, the fan.csv data I listed above has so few rows of data that Pulp finds that the problem is infeasible. These data are just the example data for this post. In reality, I have ~1,000 rows of data and the data change and are updated every day.

EDIT 1

I have 2 additional constraints to add. I am including this in this post b/c this issue this related to the initial post.

The expanded fan.csv example data are:

id       Fan    Cost    Switch  Hit Group   Cat
10001   32.83   3.18    A/B       0   G1    C1
10003   75.25   4.20    C         1   G1    C2
10005   71.60   8.79    A/B       0   G1    C3
10010   24.23   9.99    C/D       1   G2    C1
10011   98.69   8.88    D         1   G2    C2
10012   48.81   8.68    C/D       1   G2    C3

New Constraint 1

The knapsack solution can include a maximum of n ids coded 1 for Hit from any one Group.

I tried adapting Reinderien's code for this constraint. Here n is set to 2:

df02 = df01.copy()
df02.set_index(["id","Group",], drop=False, inplace=True)
for G, total in df02.Assign.groupby(level='Group').sum().items():
    model1.addConstraint(name='Group' + G + 'Hit', constraint=df02.Hit.dot(total) <= 2)

But I get this error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Users\Programs\Python\Python311\Lib\site-packages\pandas\core\series.py", line 3015, in dot
    if lvals.shape[0] != rvals.shape[0]:
                         ~~~~~~~~~~~^^^
IndexError: tuple index out of range
>>>

New Constraint 2

Additionally, the knapsack solution must consist of at least 2 of the categories (Cat).

Here is my adapted code:

df04 = df02.copy()
df04.reset_index(drop=True, inplace=True)
df04.set_index(["id","Cat",], drop=False, inplace=True)
for category, total in df04.Assign.groupby(level='Cat').sum().items():
    model1.addConstraint(name='Cat_' + category, constraint=total >= 2)

This runs without error and BUT DOES NOT represent the intended constraint.

============================================

Solution

The other answer I linked uses scipy. Here I demonstrate PuLP because it's what you were using. Aside:

[I mainly work] in SAS and R

R has linear program support of its own, so why switch to Python? Anyway, the main gist of the demonstration below is that multi-indexing makes things easy:

import numpy as np
import pandas as pd
import pulp

df = (
    pd.read_fwf('fan.csv')
    .set_index('id')
    .rename(columns={'Switch': 'SwitchOptions'})
)

# arbitrary columns, id index, values are switch string fragments
exploded_switch = df.SwitchOptions.str.split('/', expand=True)

# A, B, C, D
switch_values = pd.unique(exploded_switch.stack())

# ABCD columns, id index, values are True or NaN to indicate switch match
switch_sparse = pd.DataFrame(
    (
        exploded_switch.values[:, :, np.newaxis]
        == switch_values[np.newaxis, np.newaxis, :]
    ).any(axis=1),
    index=df.index,
    columns=pd.Index(name='Switch', data=switch_values),
)
switch_sparse[~switch_sparse] = np.nan

# Multi-index to 10 ID/switch values
df = switch_sparse.stack().align(df)[1]

# Reasonable assignment variable names; will be unique
var_names = (
    'asn_i'
    + df.index.get_level_values('id').astype(str)
    + '_s'
    + df.index.get_level_values('Switch')
).to_series(index=df.index)

# Assignment variables, named, binary, with same ID/switch multi-index
df['Assign'] = var_names.apply(pulp.LpVariable, cat=pulp.LpBinary)

# Indexed total switch availability series
switch_avail = pd.Series(
    index=switch_sparse.columns,
    data=(1, 1, 1, 2),
)

# Maximize total Fan via dot product
model = pulp.LpProblem(name='Fans', sense=pulp.LpMaximize)
model.objective = df.Assign.dot(df.Fan)
model.addConstraint(name='cost_cap', constraint=df.Assign.dot(df.Cost) <= 25)

# Constrain assignments by switch availability
for switch, total in df.Assign.groupby(level='Switch').sum().items():
    model.addConstraint(name='switch_' + switch, constraint=total <= switch_avail[switch])

# Constrain IDs to be assigned exclusively. Only add constraints where more than one switch is possible.
id_groups = df.Assign.groupby(level='id')
for ID, total in id_groups.sum()[id_groups.count() > 1].items():
    model.addConstraint(name=f'excl_id_{ID}', constraint=total <= 1)

# include a maximum of n ids coded 1 for Hit from any one Group
for group_name, total in df[df.Hit == 1].groupby('Group').Assign.sum().items():
    model.addConstraint(name=f'hit_group_{group_name}', constraint=total <= 2)

cat_names = pd.Series(name='Cat', data=df.Cat.unique())
cat_used = cat_names.apply(pulp.LpVariable, cat=pulp.LpBinary)
cat_used.index = pd.Index(name='Cat', data=cat_names)

# the knapsack solution must consist of at least 2 of the categories
model.addConstraint(name='cats_used', constraint=cat_used.sum() >= 2)

# if the ID is assigned, its category must be assigned
for (ID, switch), row in df.iterrows():
    model.addConstraint(name=f'cat_i{ID}_s{switch}', constraint=row.Assign <= cat_used[row.Cat])

# if none of the IDs in the category are assigned, the category must not be assigned
for cat, total in df.groupby('Cat').Assign.sum().items():
    model.addConstraint(name=f'cat_{cat}', constraint=cat_used[cat] <= total)

print(model)
model.solve()
assert model.status == pulp.LpStatusOptimal

df['Assign'] = df.Assign.apply(pulp.LpVariable.value)
print(df)

Fans:
MAXIMIZE
32.83*asn_i10001_sA + 32.83*asn_i10001_sB + 75.25*asn_i10003_sC + 71.6*asn_i10005_sA + 71.6*asn_i10005_sB + 24.23*asn_i10010_sC + 24.23*asn_i10010_sD + 98.69*asn_i10011_sD + 48.81*asn_i10012_sC + 48.81*asn_i10012_sD + 0.0
SUBJECT TO
cost_cap: 3.18 asn_i10001_sA + 3.18 asn_i10001_sB + 4.2 asn_i10003_sC
 + 8.79 asn_i10005_sA + 8.79 asn_i10005_sB + 9.99 asn_i10010_sC
 + 9.99 asn_i10010_sD + 8.88 asn_i10011_sD + 8.68 asn_i10012_sC
 + 8.68 asn_i10012_sD <= 25

switch_A: asn_i10001_sA + asn_i10005_sA <= 1

switch_B: asn_i10001_sB + asn_i10005_sB <= 1

switch_C: asn_i10003_sC + asn_i10010_sC + asn_i10012_sC <= 1

switch_D: asn_i10010_sD + asn_i10011_sD + asn_i10012_sD <= 2

excl_id_10001: asn_i10001_sA + asn_i10001_sB <= 1

excl_id_10005: asn_i10005_sA + asn_i10005_sB <= 1

excl_id_10010: asn_i10010_sC + asn_i10010_sD <= 1

excl_id_10012: asn_i10012_sC + asn_i10012_sD <= 1

hit_group_G1: asn_i10003_sC <= 2

hit_group_G2: asn_i10010_sC + asn_i10010_sD + asn_i10011_sD + asn_i10012_sC
 + asn_i10012_sD <= 2

cats_used: C1 + C2 + C3 >= 2

cat_i10001_sA: - C1 + asn_i10001_sA <= 0

cat_i10001_sB: - C1 + asn_i10001_sB <= 0

cat_i10003_sC: - C2 + asn_i10003_sC <= 0

cat_i10005_sA: - C3 + asn_i10005_sA <= 0

cat_i10005_sB: - C3 + asn_i10005_sB <= 0

cat_i10010_sC: - C1 + asn_i10010_sC <= 0

cat_i10010_sD: - C1 + asn_i10010_sD <= 0

cat_i10011_sD: - C2 + asn_i10011_sD <= 0

cat_i10012_sC: - C3 + asn_i10012_sC <= 0

cat_i10012_sD: - C3 + asn_i10012_sD <= 0

cat_C1: C1 - asn_i10001_sA - asn_i10001_sB - asn_i10010_sC - asn_i10010_sD
 <= 0

cat_C2: C2 - asn_i10003_sC - asn_i10011_sD <= 0

cat_C3: C3 - asn_i10005_sA - asn_i10005_sB - asn_i10012_sC - asn_i10012_sD
 <= 0

VARIABLES
0 <= C1 <= 1 Integer
0 <= C2 <= 1 Integer
0 <= C3 <= 1 Integer
0 <= asn_i10001_sA <= 1 Integer
0 <= asn_i10001_sB <= 1 Integer
0 <= asn_i10003_sC <= 1 Integer
0 <= asn_i10005_sA <= 1 Integer
0 <= asn_i10005_sB <= 1 Integer
0 <= asn_i10010_sC <= 1 Integer
0 <= asn_i10010_sD <= 1 Integer
0 <= asn_i10011_sD <= 1 Integer
0 <= asn_i10012_sC <= 1 Integer
0 <= asn_i10012_sD <= 1 Integer

Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

At line 2 NAME          MODEL
At line 3 ROWS
At line 30 COLUMNS
At line 137 RHS
At line 163 BOUNDS
At line 177 ENDATA
Problem MODEL has 25 rows, 13 columns and 70 elements
...

Result - Optimal solution found

Objective value:                255.58000000
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             0.00
Time (Wallclock seconds):       0.01

Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.01

                Fan  Cost SwitchOptions  Hit Group Cat  Assign
id    Switch                                                  
10001 A       32.83  3.18           A/B    0    G1  C1     0.0
      B       32.83  3.18           A/B    0    G1  C1     1.0
10003 C       75.25  4.20             C    1    G1  C2     1.0
10005 A       71.60  8.79           A/B    0    G1  C3     0.0
      B       71.60  8.79           A/B    0    G1  C3     0.0
10010 C       24.23  9.99           C/D    1    G2  C1     0.0
      D       24.23  9.99           C/D    1    G2  C1     0.0
10011 D       98.69  8.88             D    1    G2  C2     1.0
10012 C       48.81  8.68           C/D    1    G2  C3     0.0
      D       48.81  8.68           C/D    1    G2  C3     1.0