Search code examples
benchmarkingcluster-computingmpinasmpich

"p4_error : child process exited error"while running 256 threads of NAS benchmark on 32 node cluster


I'm trying to get a UPC-NAS Benchmark (compiled for 256 threads) running on a cluster of 32 nodes. When I run it, the rsh connections are established for 247 threads and it terminates giving an error as follows

p0_11350:  p4_error: Child process exited while making connection to remote process on dell16: 0
506 rm_l_237_24446: (26.785156) net_send: corm_11947: (215.339844) net_srm_l_1rm_24412: (26.785156) net_send: could not write to fd=4, errnrrrm_l_127_5013: (121.984375) net_send: could not w    rite to fd=5, errno = 32

Can anybody point out where the problem lies ?

It runs fine for lesser threads like 64, 128 etc.


Solution

  • Errno 32 is EPIPE (#define EPIPE 32 /* Broken pipe */).

    I suggest, that some file descriptor limit is hitted (check ulimit -a). Or network limits. Or network failure.

    Also I should mention, that p4 is anciently old. It can be some internal limit. The development of p4 stopped > 15 years ago. It is kind of very stable code in terms of inclusion into Debian Stable.

    So, why do you use mpich1? Can you move to less ancient mpich2?