algorithm genetic-algorithm gradient-descent

genetic algorithms: name the piece that drives the mutation location

To set up my question, allow me to start with an example:

Suppose a set of 1000 arrays (aka, row vectors) all the same length. Each is filled with random numbers between -1 and 1. I then pull 500 of those row vectors at random and sum them up. I now start with the sum and want to reverse engineer the selection from the original 1000.

I decide to solve this with a genetic algorithm. I start a family of bit strings 1000 bits long and run through the mutate (aka, flip-a-random-bit) and crossover procedures. Then, after a tenth of a second, I'm 75% right. Then, after another hour, I'm 76% right. Essentially, I'm stuck waiting forever for some few dozen bits to be set correctly. It may be that my random number generator never introduces those in a way that they can be merged into the solution.

The algorithm does very well at the beginning, but then fails to improve the solution further. I tried making sure that my genetic family had one of every possible bit position. That didn't help; you can't judge how quickly items will die out of the pool.

It seems that there must be an additional component to the algorithm. There must be something driving the selection of the flip-bit (aka, mutation) locations. What is the technical term for this piece? Gradient? Where does it come from?

Solution

So essentially you have a number s and you want to find a subset of numbers that add up to s. This is described as the subset sum problem which is a special case of the knapsack problem.

I think your expactations regarding the application of genetic algorithms are wrong. To visualize the number of possible solutions that have to be considered to select 500 items out of a set of 1000 items, please read the following number [the binomial coefficient for "1000 over 500"]:

270288240945436569515614693625975275496152008446548287007392875106625428705522193898612483924502370165362606085021546104802209750050679917549894219699518475423665484263751733356162464079737887344364574161119497604571044985756287880514600994219426752366915856603136862602484428109296905863799821216320 (source).

Le me clarify first: Genetic algorithms and related methods are metaheuristics, which means they are not suited to find the optimal solution, but a "good" solution in a very short time. To find the optimum solution in a problem that is NP hard you would have to try every single possible combination in the worst case. There are methods for exact optimization that intelligently partition the search space and evaluate only a smaller number of solutions, still it can take weeks to come up with the optimal answer.

If you are required to find this exact optimum I would suggest you look for exact methods such as e.g. branch and bound. You could use for example the well-known CPLEX optimizer to describe your problem as an integer program. Look for example at the ILP formulation of the TSP to see how this can be achieved and translate that to your problem.

If you are not required to find the exact optimum there are several things that you can monitor in your genetic algorithm to improve its output:

Use a large enough population size and according selection pressure. You want to avoid the effect of genetic drift, still achieve convergence.
Monitor the variance (diversity) in your population. Is it decreasing quickly? If all solutions in your population are essentially the same then the algorithm has converged. Once it has converged you would need to restart it, or introduce new genetic information to revive the evoluationary process.
Vary the strength of the mutation. Flip multiple bits at the beginning of the search and reduce to flip only few bits at the end of the search.
Use multiple crossover points (I assume you're using single-point crossover). For such a long string maybe you want to use 10 crossover points.