Search code examples
gnu-parallel

GNU Parallel not returning output values across remote hosts


Having problem where is appears that parallel (downloaded from yum via epel, not the gnu parallels site) is not returning values from processes being distributed to remote hosts and am not sure why.

The concurrent job I'm trying to run is similar to this simple example...

[myuser]$  parallel -q -j 5 \
    --sshloginfile ./parallel-nodes.txt \
    echo "Number {}: Running on `hostname`" ::: 1 2 3 4 5 6 7 8 9 10
Number 9: Running on HW04.co.local
Number 3: Running on HW04.co.local
Number 5: Running on HW04.co.local
Number 8: Running on HW04.co.local
Number 2: Running on HW04.co.local
Number 6: Running on HW04.co.local
^C^C^C^C%  

This hangs until I ctl+c out (ie. can only run from the calling host). When not providing an sshloginfile, there is no problem...

[myuser]$ parallel -q -j 5 echo "Number {}: Running on `hostname`" ::: 1 2 3 
Number 3: Running on HW04.co.local
Number 1: Running on HW04.co.local
Number 2: Running on HW04.co.local
  • I can confirm that all of the nodes in the --sshloginfile have passwordless ssh enabled and can ssh passwordless between all the nodes involved.
  • Can also confirm that gnu parallels is installed on all of the nodes involved.
  • And that the user calling parallel exists on all the nodes involved
  • as well as checking that all of the host FQDNs as they appear in the sshloginfile are named the same in the involved hosts' .ssh/known_hosts file.

When trying to run this and seeing it hanging, I tried examining the processes on each node that could be related to the parallel command...

[root@HW01 ~]# clush -ab "ps -aux | grep echo"
---------------
HW01
---------------
root     136318  0.0  0.0 294648 16468 pts/2    Sl+  15:39   0:00 /usr/bin/python2 /usr/bin/clush -ab ps -aux | grep echo
root     136322  0.0  0.0 185096  4824 pts/2    S+   15:39   0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW01 ps -aux | grep echo
root     136323  0.0  0.0 185096  4824 pts/2    S+   15:39   0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW02 ps -aux | grep echo
root     136324  0.0  0.0 185096  4820 pts/2    S+   15:39   0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW03 ps -aux | grep echo
root     136325  0.0  0.0 185096  4824 pts/2    S+   15:39   0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW04 ps -aux | grep echo
root     136334  0.0  0.0 113176  1584 ?        Ss   15:39   0:00 bash -c ps -aux | grep echo
root     136351  0.0  0.0 112712   968 ?        S    15:39   0:00 grep echo
---------------
HW02
---------------
root      85835  0.0  0.0 113176  1580 ?        Ss   15:39   0:00 bash -c ps -aux | grep echo
root      85846  0.0  0.0 112708   968 ?        S    15:39   0:00 grep echo
---------------
HW03
---------------
root     120282  0.0  0.0 113176  1576 ?        Ss   15:39   0:00 bash -c ps -aux | grep echo
root     120293  0.0  0.0 112708   968 ?        S    15:39   0:00 grep echo
---------------
HW04
---------------
hph_etl  113600  1.5  0.0 157516 11944 pts/2    S+   15:39   0:00 perl /bin/parallel -q -j 5 --sshloginfile /home/me/projects/myproject/parallel-nodes.txt echo Number {}: Running on HW04.co.local ::: 1 2 3 4 5 6 7 8 9 10
root     114154  0.0  0.0 113176  1584 ?        Ss   15:39   0:00 bash -c ps -aux | grep echo
root     114168  0.0  0.0 112712   960 ?        S    15:39   0:00 grep echo

So it seems as if the command is never communicated to the other nodes at all and just stays on the calling node (here HW04). Yet, checking if firewalld is running on any of the hosts...

[root@HW01 ~]# clush -ab systemctl status firewalld
---------------
HW01
---------------
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)
---------------
HW02
---------------
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)

Jul 16 15:17:27 HW02.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:17:28 HW02.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:32 HW02.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:33 HW02.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW03
---------------
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)

Jul 16 15:11:15 HW03.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:11:16 HW03.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:46 HW03.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:47 HW03.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW04
---------------
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2019-07-25 15:00:33 HST; 4 days ago
     Docs: man:firewalld(1)
  Process: 3303 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
 Main PID: 3303 (code=exited, status=0/SUCCESS)

Jul 25 15:00:32 HW04.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 25 15:00:33 HW04.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
clush: HW[01-04] (4): exited with exit code 3

shows it to be inactive on all hosts.

At this point, not sure what is going wrong. Can anyone think of any debugging suggestions or fixes?

** Also, neither of the commands listed above worked when including the --bibtex option in the command. Does anyone know why that would happen?


Solution

  • In the example you link to, see how the backquotes are backslashed? You need to do that or else hostname gets executed in your shell on HW04, before it talks to other machines.

    First off, I'd try this to see whether you are talking to those other machines at all:

    parallel -j 5 \
        --sshloginfile ./parallel-nodes.txt \
        echo "Number {}: Running on \`hostname\`" ::: 1 2 3 4 5 6 7 8 9 10
    

    Then, I'd try tracking down your passwordless ssh setup one machine at a time to make sure it's really working; from HW04 try:

    parallel -S HW01 'echo -n {} ""; hostname' ::: 1
    parallel -S HW02 'echo -n {} ""; hostname' ::: 1
    parallel -S HW03 'echo -n {} ""; hostname' ::: 1
    parallel -S HW04 'echo -n {} ""; hostname' ::: 1
    

    (repeat for every machine in your parallel-nodes.txt file)

    If one of those machines isn't working with ssh, you can try to debug it with:

    PARALLEL_SSH='ssh -v' parallel -S HW03 'echo -n {} ""; hostname' ::: 1