python optimization partitioning linear-programming pulp

Distribute a list of positive numbers into a desired number of sets, aiming to have sums as close as possible between them

My goal in posting this is to integrate my other post here, where I was asked to provide a minimal working example, but also to ask for advice regarding the code itself, in case someone can suggest a better/faster/more reliable approach. Hence the separate post (not in the same scope as the reproducibility post).

You can imagine several use cases of this concept.
E.g. you can have several files to process, each with different numbers of records, and you want to distribute their processing between N parallel lines, ensuring that each line has more or less the same total number of records to handle.
Or, you could have several objects of different weight, and N people to carry them from A to B, and you want to ensure that each person has more or less the same weight to carry.

See below my code, based on linear programming (essentially an assignment problem coupled with the minimisation of the sum of absolute differences).

For some reason, when run with threads > 1, it gives different results on different runs, as per above post.
But it's much slower if I run it with a single thread, so it's a catch-22 situation.

Any advice for improvement / alternative approaches that still deliver an absolute optimal solution (not greedy) would be appreciated.

# Take a list of numbers and partition them into a desired number of sets,
# ensuring that the sum within each set is as close as possible
# to the sum of all numbers divided by the number of sets.

# USER SETTINGS
list_of_numbers = [  21614,   22716, 1344708,    8948,  136944,     819,    7109,
         255182,  556354, 1898763, 1239808,  925193,  173237,   64301,
         147896,  824564,   16028, 1021326,  108042,   72221,  368270,
          17467,    2953,   52942,    1855,  739627,  460833,   30955]
N_sets_to_make = 4

import numpy as np
import pulp
from pulp import *

# Calculate the desired size of each set
S = N_sets_to_make
average_size = sum(list_of_numbers) / S
sizes = [average_size] * S

# Create the coefficient matrix
N = len(list_of_numbers)
A = np.array(list_of_numbers, ndmin = 2)

# Create the pulp model
prob = LpProblem("Partitioning", LpMinimize)

# Create the pulp variables
# x_names : binary variables encoding the presence of each initial number in each final set
x_names = ['x_'+str(i) for i in range(N * S)]
x = [LpVariable(x_names[i], lowBound = 0, upBound = 1, cat = 'Integer') for i in range(N * S)]
# X_names : continuous positive variables encoding the absolute difference between each final set sum and the desired size
X_names = ['X_'+str(i) for i in range(S)]
X = [LpVariable(X_names[i], lowBound = 0, cat = 'Continuous') for i in range(S)]

# Add the objective to the model (mimimal sum of X_i)
prob += LpAffineExpression([(X[i], 1) for i in range(S) ])

# Add the constraints to the model

# Constraints forcing each initial number to be in one and only one final set
for c in range(N):
    prob += LpAffineExpression([(x[c+m*N],+1) for m in range(S)]) == 1

# Constraints forcing each final set to be non-empty
for m in range(S):
    prob += LpAffineExpression([(x[i],+1) for i in range(m,(m+1)*N)]) >= 1

# Constraints encoding the absolute values
for m in range(S):    
        cs = [c for c in range(N) if A[0,c] != 0]
        prob += LpAffineExpression([(x[c+m*N],A[0,c]) for c in cs]) - X[m] <= sizes[m]
        prob += LpAffineExpression([(x[c+m*N],A[0,c]) for c in cs]) + X[m] >= sizes[m]

# Solve the model
prob.solve(PULP_CBC_CMD(gapRel = 0, timeLimit = 3600, threads = 1))

# Extract the solution
values_of_x_i = [value(x[i]) for i in range(N * S)]
selected_ind_initial_numbers = [(list(range(N)) * S)[i] for i,l in enumerate(values_of_x_i) if l == 1]
selected_ind_final_sets = [(list((1 + np.repeat(range(S), N)).astype('int64')))[i] for i,l in enumerate(values_of_x_i) if l == 1]
ind_final_set_for_each_initial_number = [x for _, x in sorted(zip(selected_ind_initial_numbers, selected_ind_final_sets))]

# Find the numbers that ended up in each final set
d = dict()
for m, n in sorted(zip(ind_final_set_for_each_initial_number, list_of_numbers)) :
    if m in d :
        d[m].append(n)
    else :
        d[m] = [n]
print(d)

# And their sums
s = [sum(l) for i, l in enumerate(d.values())]
print(s)

# And the absolute differences of their sums from the desired sum
absdiffs = [np.abs(s[i] - sizes[i]) for i in range(len(s))]
print(absdiffs)

# And the absolute fractional differences
print([absdiffs[i]/sizes[i]/S for i in range(len(absdiffs))])

EDIT modification of the code using row-normalised versions of the desired sizes and of the coefficient matrix, often resulting in faster execution (and making the problem independent from the total sum of numbers)

After:

sizes = [average_size] * S

add:

fracs = [1 / S] * S

After:

A = np.array(list_of_numbers, ndmin = 2)

add:

An = A / np.sum(A, axis = 1, keepdims = True)

Replace:

prob += LpAffineExpression([(x[c+m*N],A[0,c]) for c in cs]) - X[m] <= sizes[m]
prob += LpAffineExpression([(x[c+m*N],A[0,c]) for c in cs]) + X[m] >= sizes[m]

by:

prob += LpAffineExpression([(x[c+m*N],An[0,c]) for c in cs]) - X[m] <= fracs[m]
prob += LpAffineExpression([(x[c+m*N],An[0,c]) for c in cs]) + X[m] >= fracs[m]

Also, unexpectedly (to me):

prob.solve(PULP_CBC_CMD(gapRel = 0, timeLimit = 3600, threads = None))

Results in a much faster solution (occurring before the time limit is reached) than when the number of threads is specified(?).

EDIT 2 reporting the results of running the code with the changes suggested by AirSquid and the new version of the code shown in the first edit.

Modified version of my code, using fracs and An instead of sizes and A, and threads = None: very fast run time (~10 s):

{1: [8948, 255182, 1021326, 1344708], 2: [2953, 7109, 17467, 64301, 72221, 108042, 136944, 556354, 739627, 925193], 3: [819, 1855, 21614, 173237, 368270, 824564, 1239808], 4: [16028, 22716, 30955, 52942, 147896, 460833, 1898763]}
[2630164, 2630211, 2630167, 2630133]
[4.75, 42.25, 1.75, 35.75]
[4.514919432450865e-07, 4.015902021495769e-06, 1.6633913698503185e-07, 3.3980709412656508e-06]

My original code, using sizes and A, but no X variables, as per AirSquid's suggestion, gapRel = 0.0001, threads = None: almost instant run time:

{1: [819, 2953, 64301, 72221, 739627, 824564, 925193], 2: [108042, 255182, 368270, 1898763], 3: [7109, 16028, 22716, 1239808, 1344708], 4: [1855, 8948, 17467, 21614, 30955, 52942, 136944, 147896, 173237, 460833, 556354, 1021326]}
[2629678, 2630257, 2630369, 2630371]
[490.75, 88.25, 200.25, 202.25]
[4.664624655737393e-05, 8.388245050816607e-06, 1.9033949817858644e-05, 1.922405168869868e-05]

Combination of the two (fracs and An, no X variables, gapRel = 0.0001, threads = None): almost instant run time:

{1: [819, 17467, 21614, 22716, 64301, 136944, 1021326, 1344708], 2: [1855, 7109, 8948, 30955, 460833, 556354, 739627, 824564], 3: [108042, 255182, 368270, 1898763], 4: [2953, 16028, 52942, 72221, 147896, 173237, 925193, 1239808]}
[2629895, 2630245, 2630257, 2630278]
[273.75, 76.25, 88.25, 109.25]
[2.6020193571229983e-05, 7.247633825776388e-06, 8.388245050816607e-06, 1.0384314694636988e-05]

Conclusion: if the use of max_sum instead of the individual sums of absolute differences always results in close to optimality for all sets (not only the largest one), this method could be very advantageous, especially if combined with the normalisation, if one is prepared to accept a small nonzero relative gap.

Addendum after further testing

It turns out that the max_sum-based approach, even when left to run to completion (i.e. allowed to find an optimal solution, not stopped by the gap or time limit), does not result in the desired optimality.

Example:

list_of_numbers = [10000] + [100, 200, 300, 400, 500] * 5
N_sets_to_make = 3

With original method (min of sum of absolute differences) based on fractions (fracs, An):

# Final sets
{1: [10000],
 2: [200, 300, 300, 400, 500],
 3: [100,
  100,
  100,
  100,
  100,
  200,
  200,
  200,
  200,
  300,
  300,
  300,
  400,
  400,
  400,
  400,
  500,
  500,
  500,
  500]}

# Their sums
[10000, 1700, 5800]

# Absolute differences
[4166.666666666667, 4133.333333333333, 33.33333333333303]

# Fractional differences
[0.23809523809523814, 0.23619047619047617, 0.0019047619047618874]

# Their sum
0.4761904761904762

With max_sum method (min of max sum of fractions):

# Final sets
{1: [10000],
 2: [500],
 3: [100,
  100,
  100,
  100,
  100,
  200,
  200,
  200,
  200,
  200,
  300,
  300,
  300,
  300,
  300,
  400,
  400,
  400,
  400,
  400,
  500,
  500,
  500,
  500]}

# Their sums
[10000, 500, 7000]

# Absolute differences
[4166.666666666667, 5333.333333333333, 1166.666666666667]

# Fractional differences
[0.23809523809523814, 0.30476190476190473, 0.0666666666666667]

# Their sum
0.6095238095238096

Solution

First, realize that your toy problem is combinatorially quite vast.

A very crude lower bound on the number of options is:

c(28, 14) * c(14, 7) * c(7,6) * 1

Which exceeds 1e+11. The true number is likely several additional orders of magnitude.

One strategy to reduce the model size a bit is just to employ a mini-max strategy where you minimize the max size of the largest set, which, depending on how you define "comparing the sums" is probably fine.

This removes S-1 variables out of your formulation and collapses the constraint matrix substantially. Your absolute value portion is reduced by 1/2 and each of those constraints only has 1 variable in it, instead of S variables.

Do that, and change the RelGap to something reasonable and you get nearly instantaneous performance. With a RelGap of 0.0001, it solves instantly. If you must solve to 100% optimality guarantee, you're still in for a long haul just because the combinatorial magnitude of the problem and underlying integer structure.

I'd write this slightly differently, but not enough to worry about. flat indexing and numpy aren't necessary. I'd just double-index it for clarity and not use Affine Expressions as you are, because the model constructs instantly, and it is easier to read. (see my other pulp posts for examples)

Slight mods

# USER SETTINGS
list_of_numbers = [  21614,   22716, 1344708,    8948,  136944,     819,    7109,
         255182,  556354, 1898763, 1239808,  925193,  173237,   64301,
         147896,  824564,   16028, 1021326,  108042,   72221,  368270,
          17467,    2953,   52942,    1855,  739627,  460833,   30955]
N_sets_to_make = 4

import numpy as np
import pulp
from pulp import *

# Calculate the desired size of each set
S = N_sets_to_make
average_size = sum(list_of_numbers) / S
sizes = [average_size] * S

# Create the coefficient matrix
N = len(list_of_numbers)
A = np.array(list_of_numbers, ndmin = 2)

# Create the pulp model
prob = LpProblem("Partitioning", LpMinimize)

# Create the pulp variables
# x_names : binary variables encoding the presence of each initial number in each final set
x_names = ['x_'+str(i) for i in range(N * S)]
x = [LpVariable(x_names[i], lowBound = 0, upBound = 1, cat = 'Integer') for i in range(N * S)]
# X_names : continuous positive variables encoding the absolute difference between each final set sum and the desired size
X_names = ['X_'+str(i) for i in range(S)]
# X = [LpVariable(X_names[i], lowBound = 0, cat = 'Continuous') for i in range(S)]
max_sum = LpVariable(name='max_sum', cat='Continuous')
# Add the objective to the model (mimimal sum of X_i)
prob += LpAffineExpression(max_sum)

# Add the constraints to the model

# Constraints forcing each initial number to be in one and only one final set
for c in range(N):
    prob += LpAffineExpression([(x[c+m*N],+1) for m in range(S)]) == 1

# Constraints forcing each final set to be non-empty
for m in range(S):
    prob += LpAffineExpression([(x[i],+1) for i in range(m,(m+1)*N)]) >= 1

# Constraints encoding the absolute values
for m in range(S):
        cs = [c for c in range(N) if A[0,c] != 0]
        prob += LpAffineExpression([(x[c+m*N],A[0,c]) for c in cs]) <= max_sum
        # prob += LpAffineExpression([(x[c+m*N],A[0,c]) for c in cs]) + X[m] >= sizes[m]

# Solve the model
prob.solve(PULP_CBC_CMD(gapRel = 0.0001, timeLimit = 3600, threads = 1))

# Extract the solution
values_of_x_i = [value(x[i]) for i in range(N * S)]
selected_ind_initial_numbers = [(list(range(N)) * S)[i] for i,l in enumerate(values_of_x_i) if l == 1]
selected_ind_final_sets = [(list((1 + np.repeat(range(S), N)).astype('int64')))[i] for i,l in enumerate(values_of_x_i) if l == 1]
ind_final_set_for_each_initial_number = [x for _, x in sorted(zip(selected_ind_initial_numbers, selected_ind_final_sets))]

# Find the numbers that ended up in each final set
d = dict()
for m, n in sorted(zip(ind_final_set_for_each_initial_number, list_of_numbers)) :
    if m in d :
        d[m].append(n)
    else :
        d[m] = [n]
print(d)

# And their sums
s = [sum(l) for i, l in enumerate(d.values())]
print(s)

# And the absolute differences of their sums from the desired sum
absdiffs = [np.abs(s[i] - sizes[i]) for i in range(len(s))]
print(absdiffs)

# And the absolute fractional differences
print([absdiffs[i]/sizes[i]/S for i in range(len(absdiffs))])