I've created a genetic programming system in Python, but am having troubles related to memory limits. The problem is with storing all of the individuals in my population in memory. Currently, I store all individuals in memory, then reproduce the next generation's population, which then gets stored in to memory. This means that I have two populations worth of individuals loaded in memory. After some testing, I've found that I exceed the default 2GB application memory size for Windows fairly quickly.
Currently, I write out the entire population's individual trees to a file, which I can then load and recreate the population if I want. What I have been considering is instead of having all of the individuals loaded in memory, access individual information by pulling the individual from the file and only instantiating that single individual. From my understanding of Python's readline functionality, it should only load a single line from the file at a time, instead of the entire file. If I did this, I think I would be able to only store in memory the individuals that I was currently manipulating.
My question is, is there an underlining problem with doing this that I'm not seeing right now? I understand that because I am dealing with data on disk instead of in memory my performance is going to take a hit, but for this situation memory is more important than speed. Also I don't want to increase the allotted 2GB of memory given to Python programs.
Thanks!
Given the RAM constraint, I'd change the population model from generational to steady state.
The idea is to iteratively breed a new child or two, assess their fitness and then reintroduce them directly into the population itself, killing off some preexisting individuals to make room for them.
Steady state uses half the memory of a traditional genetic algorithm because there is only one population at a time.
Changing the implementation shouldn't be too hard, but you have to pay attention to premature convergence (i.e. tweaks parameters like mutation rate, tournament size...).
The island model is another / additional possibility: population is broken into separate sub-populations (demes). Demes send individuals to one another to help spread news of newly-discovered fit areas of the space.
Usually it's a asynchronous mechanism but you could use a synchronous algorithm, loading demes one by one, with a great reduction of the required memory resources.
Of course you can write the population to a file and you can load just the needed individuals. If you choose this approach, it's probably a good idea to compute a hash signature of individuals to optimize the identification / loading speed.
Anyway you should consider that, depending on the task your GP system is performing, you could register a massive performance hit.