Particle position not being parametrized properly in pyswarms

I am having trouble designing a fitness function for pyswarms which will actually iterate through the particles. I am basing my design off of this (working) example code:

# import modules
import numpy as np

# create a parameterized version of the classic Rosenbrock unconstrained optimzation function
def rosenbrock_with_args(x, a, b, c=0):
    f = (a - x[:, 0]) ** 2 + b * (x[:, 1] - x[:, 0] ** 2) ** 2 + c
    return f

from pyswarms.single.global_best import GlobalBestPSO

# instatiate the optimizer
x_max = 10 * np.ones(2)
x_min = -1 * x_max
bounds = (x_min, x_max)
options = {'c1': 0.5, 'c2': 0.3, 'w': 0.9}
optimizer = GlobalBestPSO(n_particles=10, dimensions=2, options=options, bounds=bounds)

# now run the optimization, pass a=1 and b=100 as a tuple assigned to args

cost, pos = optimizer.optimize(rosenbrock_with_args, 1000, a=1, b=100, c=0)

kwargs={"a": 1.0, "b": 100.0, 'c':0}

It seems that by writing x[:, 0] and x[:, 1], this somehow parametrizes the particle position matrix for the optimization function. For example, executing x[:, 0] in the debugger returns:

array([ 9.19955426, -5.31471451, -2.28507312, -2.53652044, -6.29916204, -8.44170591, 7.80464884, -6.42048159, 9.77440842, -9.06991295])

Now, jumping to (a snippet from) my code, I have this:

def optimize_eps_and_mp(x):

    clusterer = DBSCAN(eps=x[:, 0], min_samples=x[:, 1], metric="precomputed")
    clusterer.fit(distance_matrix)
    clusters = pd.DataFrame.from_dict({index_to_gid[i[0]]: [i[1]] for i in enumerate(clusterer.labels_)},
                                      orient="index", columns=["cluster"])
    settlements_clustered = settlements.join(clusters)
    cluster_pops = settlements_clustered.loc[settlements_clustered["cluster"] >= 0].groupby(["cluster"]).sum()["pop_sum"].to_list()
    print()

    return 1


options = {'c1': 0.5, 'c2': 0.3, 'w':0.9}
max_bound = [1000, 10]
min_bound = [1, 2]
bounds = (min_bound, max_bound)
n_particles = 10

optimizer = ps.single.GlobalBestPSO(n_particles=n_particles, dimensions=2, options=options, bounds=bounds)
cost, pos = optimizer.optimize(optimize_eps_and_mp, iters=1000)

(The variables distance_matrix and settlements are defined earlier in the code, but it is failing on the line clusterer = DBSCAN(eps=x[:, 0], min_samples=x[:, 1], metric="precomputed") so they are not relevant. Also, I am aware that it is always returning 1, I am just trying to get it to run without errors before finishing the function)

When I execute x[:, 0] in the debugger, it returns:

array([-4.54925788, 3.94338766, 0.97085618, 9.44128746, -2.1932764 , 9.24640763, 9.18286758, -8.91052863, 0.637599 , -2.28228841])

So, identical to the working example in terms of structure. But it fails on the line clusterer = DBSCAN(eps=x[:, 0], min_samples=x[:, 1], metric="precomputed") because it is passing the entire contents of x[:, 0] to the DBSCAN function rather than parameterizing it like in the working example.

Is there some difference between these examples that I am just not seeing?

I have also tried to paste the fitness function from the working example (rosenbrock_with_args) into my code and optimize that instead, to eliminate any possibility that some way that I have my implementation set up is incorrect. The solution then converges as normal, so I am completely out of ideas as to why it does not work with my function (optimize_eps_and_mp)

The exact stacktrace that I get refers to an error in the dbscan algorithm, I am assuming due to it somehow being passed the entire set of particle swarm values rather than individual values:

pyswarms.single.global_best:   0%|          |0/1000Traceback (most recent call last):
  File "C:/FILES/boates/work_local/_code/warping-pso-dbscan/optimize_eps_and_mp.py", line 63, in <module>
    cost, pos = optimizer.optimize(optimize_eps_and_mp, iters=1000)
  File "C:\FILES\boates\Anaconda\envs\warping_pso_dbscan\lib\site-packages\pyswarms\single\global_best.py", line 184, in optimize
    self.swarm.current_cost = compute_objective_function(self.swarm, objective_func, pool=pool, **kwargs)
  File "C:\FILES\boates\Anaconda\envs\warping_pso_dbscan\lib\site-packages\pyswarms\backend\operators.py", line 239, in compute_objective_function
    return objective_func(swarm.position, **kwargs)
  File "C:/FILES/boates/work_local/_code/warping-pso-dbscan/optimize_eps_and_mp.py", line 38, in optimize_eps_and_mp
    clusterer.fit(distance_matrix)
  File "C:\FILES\boates\Anaconda\envs\warping_pso_dbscan\lib\site-packages\sklearn\cluster\dbscan_.py", line 351, in fit
    **self.get_params())
  File "C:\FILES\boates\Anaconda\envs\warping_pso_dbscan\lib\site-packages\sklearn\cluster\dbscan_.py", line 139, in dbscan
    if not eps > 0.0:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
pyswarms.single.global_best:   0%|          |0/1000

Solution

TL;DR

Particles swarm optimisation use batches. Given a batch of particles, the optimized function must return a batch of costs.

Error message explanation

Here is the interesting part of the error message:

  [...]
  File "C:\FILES\boates\Anaconda\envs\warping_pso_dbscan\lib\site-packages\sklearn\cluster\dbscan_.py", line 139, in dbscan
    if not eps > 0.0:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This is a very common Numpy's error message. It appears when you try to use an array as a condition. As explained by the message, what is the truth value of an array like [True, False]. You have to use functions like all() or any() to convert your array to a single boolean value.

So, why does this happen? Because eps is not intended to be an array.

From the documentation of the DBSCAN class, the parameters eps and min_samples are optional integers. Here you pass them arrays.

clusterer = DBSCAN(eps=x[:, 0], min_samples=x[:, 1], metric="precomputed")

Examples comparison

You asked why your code works with the rosenbrock_with_args function. That's because it execute operations that handles nicely the array. You pass it a two dimensional array x (the batch of particles) of shape [10, 2] (10 particules of dimension 2) and a, b, c scalars. From that, it computes a 1 dimensional array of shape [10], which is the cost value for each particule.

Your new optimize_eps_and_mp function, however, try to do some operations on the array that are not supported. In particular, you use one dimension of the array as the eps parameter of DBSCAN which expect a scalar.

To make it works you should handle the batch yourself, instanciating many DBSCAN objects:

for row in x:
  clusterer = DBSCAN(eps=row[0], min_value=row[1], [...])

Distributed execution

You said that:

the pyswarms library is supposed to run it [the objective function] many times independently (for each particle in the swarm) and evaluate their results, and it does this somehow by distributing the function to multiple sets of inputs all at once.

pyswarm can actually parallelize your swarm execution with the n_processes argument of the optimize function. In this case, your function is called multiple time in different processes, but still with arrays as inputs. In your case, with 10 particules, 2 dimensions and n_processes as None (the default), your x input is of shape [10, 2]. If you set n_processes to 2 your x input will have shape [5, 2]. Finally if you set n_processes to 10, your x input will have shape [1, 2]. In either case, you have to "un-roll" your particles swarm for DBSCAN instanciation.

import pyswarms as ps


def foo(x):
    print(x.shape)
    return x[:,0]


if __name__ == "__main__":
    options = {'c1': 0.5, 'c2': 0.3, 'w': 0.9}
    max_bound = [1000, 10]
    min_bound = [1, 2]
    bounds = (min_bound, max_bound)
    n_particles = 10

    optimizer = ps.single.GlobalBestPSO(n_particles=n_particles, dimensions=2, options=options, bounds=bounds)
    for n_processes in [None, 1, 2, 10]:
        print("\nParallelizing on {} processes.".format(n_processes))
        optimizer.optimize(foo, iters=1, n_processes=n_processes)

Parallelizing on None processes.
(10, 2)

Parallelizing on 1 processes.
(10, 2)

Parallelizing on 2 processes.
(5, 2)
(5, 2)

Parallelizing on 10 processes.
(1, 2)
(1, 2)
(1, 2)
(1, 2)
(1, 2)
(1, 2)
(1, 2)
(1, 2)
(1, 2)
(1, 2)

So, here is complete example on how you can use DBSCAN in your case.

def optimize_eps_and_mp(x):
    num_particles = x.shape[0]
    costs = np.zeros([num_particles])
    print("Particles swarm", x)

    for idx, particle in enumerate(x):
        print("Particle", particle)
        clusterer = DBSCAN(eps=x[0], min_samples=x[1], metric="precomputed")
        clusterer.fit(distance_matrix)
        clusters = pd.DataFrame.from_dict({index_to_gid[i[0]]: [i[1]] for i in enumerate(clusterer.labels_)},
                                      orient="index", columns=["cluster"])
        settlements_clustered = settlements.join(clusters)
        cluster_pops = settlements_clustered.loc[settlements_clustered["cluster"] >= 0].groupby(["cluster"]).sum()["pop_sum"].to_list()

        cost = 1  # Update this to compute cost value of the current particle
        costs[idx] = cost

    return costs