Search code examples
fb-hydra

How does Hydras sweeper, specifically Ax-sweeper free/allocate memory?


So I'm using Hydra 1.1 and hydra-ax-sweeper==1.1.5 to manage my configuration, and run some hyper-parameter optimization on minerl environment. For this purpose, I load a lot of data in to memory (peak around 50Gb while loading with multiprocessing, drops to 30Gb after fully loaded) with multiprocessing (by pytorch).

On a normal run this is not a problem (My machine have 90+Gb RAM), one training finish without any issue.

However, when I run the same code with -m option (and hydra/sweeper: ax in config), the code stops after about 2-3 sweeper runs, getting stuck at the data loading phase, because all memories of the system (+swap memory) is occupied.

First I thought this was some issue with minerl environment code, which starts java-code in sub-process. So I tried to run my code without the environment (only the 30Gb data), and I still have the same issue. So I suspect I have some memory-leak inbetween the Hydra sweeper.

So my question is, How does Hydra sweeper(or ax-sweeper) work in-between sweeps? I always had the impression that it runs the main(cfg: DictConfig) decorated with @hydra.main(...), takes a scalar return(score) and run the Bayesian optimizer with this score, with main() called similar to a function (everything inside being properly deallocated/garbage collected between each sweep-run).

Is this not the case? Should I then load the data somewhere outside the main() and keep it between sweeps?

Thank you very much in advance!


Solution

  • The hydra-ax-sweeper may run trials in parallel, depending on the result of calling the get_max_parallelism function defined in ax.service.ax_client. I suspect that your machine is running out of memory because of this parallelism.

    Hydra's Ax plugin does not currently have a config group for configuring this max_parallelism setting, so it is automatically set by ax.

    Loading the data outside of main (as you suggested) may be a good workaround for this issue.