Search code examples
pythonlinear-programmingpulp

Python Pulp - 0/1 Knapsack Problem - How to Ensure Items (IDs) are Selected At Most Once


I am a little out of my element here: I am a statistician who mainly works in SAS and R. I am attempting to code what is essentially the classic knapsack problem in Python using Pulp. My data look something like this (this is a simplified version):

The fan.csv example data

id       Fan    Cost    Switch
10001   32.83   3.18    A/B
10003   75.25   4.20    C
10005   71.60   8.79    A/B
10010   24.23   9.99    C/D
10011   98.69   8.88    D
10012   48.81   8.68    C/D

I need to maximize Fan while choosing 1 or 2 items in each Switch category (e.g. 1 "A", 2 "Ds", etc.) all while coming in under a Cost threshold. So far, pretty standard.

My issue is that some items can be in one or the other Switch category (see obs with '/' in the Switch value - e.g., id 10001 can serve as either 'A' or 'B'). Because I am a statistician, I think in terms of n x k datasets and in this case I would set the data up in long form (as if these were repeated measures data):

id       Fan    Cost    Switch
10001   32.83   3.18    A
10001   32.83   3.18    B
10003   75.25   4.20    C
10005   71.60   8.79    A
10005   71.60   8.79    B
10010   24.23   9.99    C
10010   24.23   9.99    D
10011   98.69   8.88    D
10012   48.81   8.68    C
10012   48.81   8.68    D

HOWEVER, I have essentially taken someone else's example code to call the Pulp optimization model and I adapted it to my situation. My current problem is that it is unclear to me how to add constraints to the linear optimization model so that each item (each id) is selected at most just once. This is an issue because of the way I 'unpacked' the Switch variable and listed multiple id's with different Switch values.

My code appear below. In step 1, I convert the fan.csv data into long form (I wrote this code myself). Step 2 is mostly the example data that I adapted. The sample code organizes the data in nested dictionaries and then calls the Pulp functions to specify the optimization model.

import os
import pandas as pd
from pulp import *
import numpy as np

# Step 1

os.chdir("E:\\")

df01 = pd.read_csv("fan.csv")

df02 = df01[df01['Switch'].str.contains('/')]
df03 = df01[~df01['Switch'].str.contains('/')]

df02[['Swit1','Swit2']] = df02.Switch.str.split("/", expand=True)
df04 = pd.wide_to_long(df02, ["Swit"], i="id", j="Pos")
df04.reset_index(inplace=True)
df04.drop(["Pos"], axis=1, inplace=True)

df05 = pd.concat([df04,df03])
df06 = df05.sort_values(by=['id'])
df06.reset_index(drop=True, inplace=True)

df06.loc[(~df06['Switch'].str.contains('/')), 'Swit'] = df06.Switch

# Step 2    

df07 = df06.groupby(["Swit", "id", "Fan", "Cost"]).agg('count')
df07 = df07.reset_index()

costs = {}
fans = {}

for sth in df07.Swit.unique():
    df07_sth = df07[df07.Swit == sth]
    cost = list(df07_sth[['id', 'Cost']].set_index("id").to_dict().values())[0]
    fan = list(df07_sth[['id', 'Fan']].set_index("id").to_dict().values())[0]
    costs[sth] = cost
    fans[sth] = fan


sth_num_available = {
    "A": 1,
    "B": 1,
    "C": 1,
    "D": 2,
                    }

cost_cap = 25

_vars = {k: LpVariable.dict(k, v, cat='Binary') for k, v in fans.items()}

model1 = LpProblem("Fans", LpMaximize)
fanval = []
costval = []
switch_constraints = []

for k, v in _vars.items():
    costval += lpSum([costs[k][i] * _vars[k][i] for i in v])
    fanval += lpSum([fans[k][i] * _vars[k][i] for i in v])
    model1 += lpSum([_vars[k][i] for i in v]) == sth_num_available[k]

model1 += lpSum(fanval)
model1 += lpSum(costval) <= cost_cap
model1

model1.solve()
  • To be clear, the fan.csv data I listed above has so few rows of data that Pulp finds that the problem is infeasible. These data are just the example data for this post. In reality, I have ~1,000 rows of data and the data change and are updated every day.

EDIT 1

I have 2 additional constraints to add. I am including this in this post b/c this issue this related to the initial post.

The expanded fan.csv example data are:

id       Fan    Cost    Switch  Hit Group   Cat
10001   32.83   3.18    A/B       0   G1    C1
10003   75.25   4.20    C         1   G1    C2
10005   71.60   8.79    A/B       0   G1    C3
10010   24.23   9.99    C/D       1   G2    C1
10011   98.69   8.88    D         1   G2    C2
10012   48.81   8.68    C/D       1   G2    C3

New Constraint 1

The knapsack solution can include a maximum of n ids coded 1 for Hit from any one Group.

I tried adapting Reinderien's code for this constraint. Here n is set to 2:

df02 = df01.copy()
df02.set_index(["id","Group",], drop=False, inplace=True)
for G, total in df02.Assign.groupby(level='Group').sum().items():
    model1.addConstraint(name='Group' + G + 'Hit', constraint=df02.Hit.dot(total) <= 2)

But I get this error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Users\Programs\Python\Python311\Lib\site-packages\pandas\core\series.py", line 3015, in dot
    if lvals.shape[0] != rvals.shape[0]:
                         ~~~~~~~~~~~^^^
IndexError: tuple index out of range
>>>

New Constraint 2

Additionally, the knapsack solution must consist of at least 2 of the categories (Cat).

Here is my adapted code:

df04 = df02.copy()
df04.reset_index(drop=True, inplace=True)
df04.set_index(["id","Cat",], drop=False, inplace=True)
for category, total in df04.Assign.groupby(level='Cat').sum().items():
    model1.addConstraint(name='Cat_' + category, constraint=total >= 2)

This runs without error and BUT DOES NOT represent the intended constraint.

============================================


Solution

  • The other answer I linked uses scipy. Here I demonstrate PuLP because it's what you were using. Aside:

    [I mainly work] in SAS and R

    R has linear program support of its own, so why switch to Python? Anyway, the main gist of the demonstration below is that multi-indexing makes things easy:

    import numpy as np
    import pandas as pd
    import pulp
    
    df = (
        pd.read_fwf('fan.csv')
        .set_index('id')
        .rename(columns={'Switch': 'SwitchOptions'})
    )
    
    # arbitrary columns, id index, values are switch string fragments
    exploded_switch = df.SwitchOptions.str.split('/', expand=True)
    
    # A, B, C, D
    switch_values = pd.unique(exploded_switch.stack())
    
    # ABCD columns, id index, values are True or NaN to indicate switch match
    switch_sparse = pd.DataFrame(
        (
            exploded_switch.values[:, :, np.newaxis]
            == switch_values[np.newaxis, np.newaxis, :]
        ).any(axis=1),
        index=df.index,
        columns=pd.Index(name='Switch', data=switch_values),
    )
    switch_sparse[~switch_sparse] = np.nan
    
    # Multi-index to 10 ID/switch values
    df = switch_sparse.stack().align(df)[1]
    
    # Reasonable assignment variable names; will be unique
    var_names = (
        'asn_i'
        + df.index.get_level_values('id').astype(str)
        + '_s'
        + df.index.get_level_values('Switch')
    ).to_series(index=df.index)
    
    # Assignment variables, named, binary, with same ID/switch multi-index
    df['Assign'] = var_names.apply(pulp.LpVariable, cat=pulp.LpBinary)
    
    # Indexed total switch availability series
    switch_avail = pd.Series(
        index=switch_sparse.columns,
        data=(1, 1, 1, 2),
    )
    
    # Maximize total Fan via dot product
    model = pulp.LpProblem(name='Fans', sense=pulp.LpMaximize)
    model.objective = df.Assign.dot(df.Fan)
    model.addConstraint(name='cost_cap', constraint=df.Assign.dot(df.Cost) <= 25)
    
    # Constrain assignments by switch availability
    for switch, total in df.Assign.groupby(level='Switch').sum().items():
        model.addConstraint(name='switch_' + switch, constraint=total <= switch_avail[switch])
    
    # Constrain IDs to be assigned exclusively. Only add constraints where more than one switch is possible.
    id_groups = df.Assign.groupby(level='id')
    for ID, total in id_groups.sum()[id_groups.count() > 1].items():
        model.addConstraint(name=f'excl_id_{ID}', constraint=total <= 1)
    
    # include a maximum of n ids coded 1 for Hit from any one Group
    for group_name, total in df[df.Hit == 1].groupby('Group').Assign.sum().items():
        model.addConstraint(name=f'hit_group_{group_name}', constraint=total <= 2)
    
    cat_names = pd.Series(name='Cat', data=df.Cat.unique())
    cat_used = cat_names.apply(pulp.LpVariable, cat=pulp.LpBinary)
    cat_used.index = pd.Index(name='Cat', data=cat_names)
    
    # the knapsack solution must consist of at least 2 of the categories
    model.addConstraint(name='cats_used', constraint=cat_used.sum() >= 2)
    
    # if the ID is assigned, its category must be assigned
    for (ID, switch), row in df.iterrows():
        model.addConstraint(name=f'cat_i{ID}_s{switch}', constraint=row.Assign <= cat_used[row.Cat])
    
    # if none of the IDs in the category are assigned, the category must not be assigned
    for cat, total in df.groupby('Cat').Assign.sum().items():
        model.addConstraint(name=f'cat_{cat}', constraint=cat_used[cat] <= total)
    
    print(model)
    model.solve()
    assert model.status == pulp.LpStatusOptimal
    
    df['Assign'] = df.Assign.apply(pulp.LpVariable.value)
    print(df)
    
    Fans:
    MAXIMIZE
    32.83*asn_i10001_sA + 32.83*asn_i10001_sB + 75.25*asn_i10003_sC + 71.6*asn_i10005_sA + 71.6*asn_i10005_sB + 24.23*asn_i10010_sC + 24.23*asn_i10010_sD + 98.69*asn_i10011_sD + 48.81*asn_i10012_sC + 48.81*asn_i10012_sD + 0.0
    SUBJECT TO
    cost_cap: 3.18 asn_i10001_sA + 3.18 asn_i10001_sB + 4.2 asn_i10003_sC
     + 8.79 asn_i10005_sA + 8.79 asn_i10005_sB + 9.99 asn_i10010_sC
     + 9.99 asn_i10010_sD + 8.88 asn_i10011_sD + 8.68 asn_i10012_sC
     + 8.68 asn_i10012_sD <= 25
    
    switch_A: asn_i10001_sA + asn_i10005_sA <= 1
    
    switch_B: asn_i10001_sB + asn_i10005_sB <= 1
    
    switch_C: asn_i10003_sC + asn_i10010_sC + asn_i10012_sC <= 1
    
    switch_D: asn_i10010_sD + asn_i10011_sD + asn_i10012_sD <= 2
    
    excl_id_10001: asn_i10001_sA + asn_i10001_sB <= 1
    
    excl_id_10005: asn_i10005_sA + asn_i10005_sB <= 1
    
    excl_id_10010: asn_i10010_sC + asn_i10010_sD <= 1
    
    excl_id_10012: asn_i10012_sC + asn_i10012_sD <= 1
    
    hit_group_G1: asn_i10003_sC <= 2
    
    hit_group_G2: asn_i10010_sC + asn_i10010_sD + asn_i10011_sD + asn_i10012_sC
     + asn_i10012_sD <= 2
    
    cats_used: C1 + C2 + C3 >= 2
    
    cat_i10001_sA: - C1 + asn_i10001_sA <= 0
    
    cat_i10001_sB: - C1 + asn_i10001_sB <= 0
    
    cat_i10003_sC: - C2 + asn_i10003_sC <= 0
    
    cat_i10005_sA: - C3 + asn_i10005_sA <= 0
    
    cat_i10005_sB: - C3 + asn_i10005_sB <= 0
    
    cat_i10010_sC: - C1 + asn_i10010_sC <= 0
    
    cat_i10010_sD: - C1 + asn_i10010_sD <= 0
    
    cat_i10011_sD: - C2 + asn_i10011_sD <= 0
    
    cat_i10012_sC: - C3 + asn_i10012_sC <= 0
    
    cat_i10012_sD: - C3 + asn_i10012_sD <= 0
    
    cat_C1: C1 - asn_i10001_sA - asn_i10001_sB - asn_i10010_sC - asn_i10010_sD
     <= 0
    
    cat_C2: C2 - asn_i10003_sC - asn_i10011_sD <= 0
    
    cat_C3: C3 - asn_i10005_sA - asn_i10005_sB - asn_i10012_sC - asn_i10012_sD
     <= 0
    
    VARIABLES
    0 <= C1 <= 1 Integer
    0 <= C2 <= 1 Integer
    0 <= C3 <= 1 Integer
    0 <= asn_i10001_sA <= 1 Integer
    0 <= asn_i10001_sB <= 1 Integer
    0 <= asn_i10003_sC <= 1 Integer
    0 <= asn_i10005_sA <= 1 Integer
    0 <= asn_i10005_sB <= 1 Integer
    0 <= asn_i10010_sC <= 1 Integer
    0 <= asn_i10010_sD <= 1 Integer
    0 <= asn_i10011_sD <= 1 Integer
    0 <= asn_i10012_sC <= 1 Integer
    0 <= asn_i10012_sD <= 1 Integer
    
    Welcome to the CBC MILP Solver 
    Version: 2.10.3 
    Build Date: Dec 15 2019 
    
    At line 2 NAME          MODEL
    At line 3 ROWS
    At line 30 COLUMNS
    At line 137 RHS
    At line 163 BOUNDS
    At line 177 ENDATA
    Problem MODEL has 25 rows, 13 columns and 70 elements
    ...
    
    Result - Optimal solution found
    
    Objective value:                255.58000000
    Enumerated nodes:               0
    Total iterations:               0
    Time (CPU seconds):             0.00
    Time (Wallclock seconds):       0.01
    
    Option for printingOptions changed from normal to all
    Total time (CPU seconds):       0.00   (Wallclock seconds):       0.01
    
                    Fan  Cost SwitchOptions  Hit Group Cat  Assign
    id    Switch                                                  
    10001 A       32.83  3.18           A/B    0    G1  C1     0.0
          B       32.83  3.18           A/B    0    G1  C1     1.0
    10003 C       75.25  4.20             C    1    G1  C2     1.0
    10005 A       71.60  8.79           A/B    0    G1  C3     0.0
          B       71.60  8.79           A/B    0    G1  C3     0.0
    10010 C       24.23  9.99           C/D    1    G2  C1     0.0
          D       24.23  9.99           C/D    1    G2  C1     0.0
    10011 D       98.69  8.88             D    1    G2  C2     1.0
    10012 C       48.81  8.68           C/D    1    G2  C3     0.0
          D       48.81  8.68           C/D    1    G2  C3     1.0