Search code examples
mpiopenmdao

Optimization Hangs When Using Nonlinear Solver While Running Under MPI


I am trying to solve optimization problems using gradient-free algorithms (such as the simple genetic algorithm) in OpemMDAO, utilizing parallel function evaluation with MPI. When my problem does not have cycles I do not run into any problems. However, as soon as I have to use a nonlinear solver to converge a cycle, the process will hang indefinitely after one of the ranks' nl_solver finishes.

Here is a code example (solve_sellar.py):

import openmdao.api as om
from openmdao.test_suite.components.sellar_feature import SellarMDA
from openmdao.utils.mpi import MPI

if not MPI:
    rank = 0
else:
    rank = MPI.COMM_WORLD.rank


if __name__ == "__main__":
    prob = om.Problem()
    prob.model = SellarMDA()

    prob.model.add_design_var('x', lower=0, upper=10)
    prob.model.add_design_var('z', lower=0, upper=10)
    prob.model.add_objective('obj')
    prob.model.add_constraint('con1', upper=0)
    prob.model.add_constraint('con2', upper=0)

    prob.driver = om.SimpleGADriver(run_parallel=(MPI is not None), bits={"x": 32, "z": 32})

    prob.setup()
    prob.set_solver_print(level=0)

    prob.run_driver()

    if rank == 0:
        print('minimum found at')
        print(prob['x'][0])
        print(prob['z'])

        print('minumum objective')
        print(prob['obj'][0])

As you can see, this code is meant to solve the Sellar problem using the SimpleGADriver that is included in OpenMDAO. When I simply run this code in serial (python3 solve_sellar.py) I get a result after a while and the following output:

Unable to import mpi4py. Parallel processing unavailable.
NL: NLBGSSolver 'NL: NLBGS' on system 'cycle' failed to converge in 10 iterations.
<string>:1: RuntimeWarning: overflow encountered in exp
NL: NLBGSSolver 'NL: NLBGS' on system 'cycle' failed to converge in 10 iterations.
minimum found at
0.0
[0. 0.]
minumum objective
0.7779677271254263

If I instead run this with MPI (mpirun -np 16 python3 solve_sellar.py) I get the following output:

NL: NLBJSolver 'NL: NLBJ' on system 'cycle' failed to converge in 10 iterations.

And then a whole lot of nothing. The command hangs and blocks the assigned processors, but there is no further output. Eventually I kill the command with CTRL-C. The process then continues to hang after the following output:

[mpiexec@eb26233a2dd8] Sending Ctrl-C to processes as requested
[mpiexec@eb26233a2dd8] Press Ctrl-C again to force abort

Hence, I have to force abort the process:

Ctrl-C caught... cleaning up processes
[proxy:0:0@eb26233a2dd8] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
[proxy:0:0@eb26233a2dd8] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@eb26233a2dd8] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[mpiexec@eb26233a2dd8] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@eb26233a2dd8] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@eb26233a2dd8] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@eb26233a2dd8] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion

You should be able to reproduce this on any working MPI-enabled OpenMDAO environment, but I have made a Dockerfile as well to ensure the environment is consistent:

FROM danieldv/hode:latest

RUN pip3 install --upgrade openmdao==2.9.0

ADD . /usr/src/app
WORKDIR /usr/src/app

CMD mpirun -np 16 python3 solve_sellar.py

Does anyone have a suggestion of how to solve this?


Solution

  • Thank you for reporting this. Yes, this looks like a bug that we introduced when we fixed the MPI norm calculation on some of the solvers.

    This bug has now been fixed as of commit c4369225f43e56133d5dd4238d1cdea07d76ecc3. You can access the fix by pulling down the latest from the OpenMDAO github repo, or wait until the next release (which will be 2.9.2).