So I'm using Hydra 1.1 and hydra-ax-sweeper==1.1.5
to manage my configuration, and run some hyper-parameter optimization on minerl environment. For this purpose, I load a lot of data in to memory (peak around 50Gb while loading with multiprocessing, drops to 30Gb after fully loaded) with multiprocessing (by pytorch).
On a normal run this is not a problem (My machine have 90+Gb RAM), one training finish without any issue.
However, when I run the same code with -m
option (and hydra/sweeper: ax
in config), the code stops after about 2-3 sweeper runs, getting stuck at the data loading phase, because all memories of the system (+swap memory) is occupied.
First I thought this was some issue with minerl
environment code, which starts java-code in sub-process. So I tried to run my code without the environment (only the 30Gb data), and I still have the same issue. So I suspect I have some memory-leak inbetween the Hydra sweeper.
So my question is, How does Hydra sweeper(or ax-sweeper) work in-between sweeps? I always had the impression that it runs the main(cfg: DictConfig)
decorated with @hydra.main(...)
, takes a scalar return(score) and run the Bayesian optimizer with this score, with main()
called similar to a function (everything inside being properly deallocated/garbage collected between each sweep-run).
Is this not the case? Should I then load the data somewhere outside the main()
and keep it between sweeps?
Thank you very much in advance!
The hydra-ax-sweeper
may run trials in parallel, depending on the result of calling the get_max_parallelism
function defined in ax.service.ax_client
.
I suspect that your machine is running out of memory because of this parallelism.
Hydra's Ax plugin does not currently have a config group for configuring this max_parallelism
setting, so it is automatically set by ax.
Loading the data outside of main (as you suggested) may be a good workaround for this issue.