Having problem where is appears that parallel (downloaded from yum
via epel, not the gnu parallels site) is not returning values from processes being distributed to remote hosts and am not sure why.
The concurrent job I'm trying to run is similar to this simple example...
[myuser]$ parallel -q -j 5 \
--sshloginfile ./parallel-nodes.txt \
echo "Number {}: Running on `hostname`" ::: 1 2 3 4 5 6 7 8 9 10
Number 9: Running on HW04.co.local
Number 3: Running on HW04.co.local
Number 5: Running on HW04.co.local
Number 8: Running on HW04.co.local
Number 2: Running on HW04.co.local
Number 6: Running on HW04.co.local
^C^C^C^C%
This hangs until I ctl+c out (ie. can only run from the calling host). When not providing an sshloginfile
, there is no problem...
[myuser]$ parallel -q -j 5 echo "Number {}: Running on `hostname`" ::: 1 2 3
Number 3: Running on HW04.co.local
Number 1: Running on HW04.co.local
Number 2: Running on HW04.co.local
--sshloginfile
have passwordless ssh enabled and can ssh passwordless between all the nodes involved. parallel
exists on all the nodes involved sshloginfile
are named the same in the involved hosts' .ssh/known_hosts
file. When trying to run this and seeing it hanging, I tried examining the processes on each node that could be related to the parallel
command...
[root@HW01 ~]# clush -ab "ps -aux | grep echo"
---------------
HW01
---------------
root 136318 0.0 0.0 294648 16468 pts/2 Sl+ 15:39 0:00 /usr/bin/python2 /usr/bin/clush -ab ps -aux | grep echo
root 136322 0.0 0.0 185096 4824 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW01 ps -aux | grep echo
root 136323 0.0 0.0 185096 4824 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW02 ps -aux | grep echo
root 136324 0.0 0.0 185096 4820 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW03 ps -aux | grep echo
root 136325 0.0 0.0 185096 4824 pts/2 S+ 15:39 0:00 ssh -oForwardAgent=no -oForwardX11=no -oConnectTimeout=15 -oBatchMode=yes HW04 ps -aux | grep echo
root 136334 0.0 0.0 113176 1584 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 136351 0.0 0.0 112712 968 ? S 15:39 0:00 grep echo
---------------
HW02
---------------
root 85835 0.0 0.0 113176 1580 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 85846 0.0 0.0 112708 968 ? S 15:39 0:00 grep echo
---------------
HW03
---------------
root 120282 0.0 0.0 113176 1576 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 120293 0.0 0.0 112708 968 ? S 15:39 0:00 grep echo
---------------
HW04
---------------
hph_etl 113600 1.5 0.0 157516 11944 pts/2 S+ 15:39 0:00 perl /bin/parallel -q -j 5 --sshloginfile /home/me/projects/myproject/parallel-nodes.txt echo Number {}: Running on HW04.co.local ::: 1 2 3 4 5 6 7 8 9 10
root 114154 0.0 0.0 113176 1584 ? Ss 15:39 0:00 bash -c ps -aux | grep echo
root 114168 0.0 0.0 112712 960 ? S 15:39 0:00 grep echo
So it seems as if the command is never communicated to the other nodes at all and just stays on the calling node (here HW04). Yet, checking if firewalld
is running on any of the hosts...
[root@HW01 ~]# clush -ab systemctl status firewalld
---------------
HW01
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
---------------
HW02
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
Jul 16 15:17:27 HW02.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:17:28 HW02.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:32 HW02.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:33 HW02.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW03
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
Jul 16 15:11:15 HW03.ucera.local systemd[1]: Starting firewalld - dynamic firewall daemon...
Jul 16 15:11:16 HW03.ucera.local systemd[1]: Started firewalld - dynamic firewall daemon.
Jul 17 16:05:46 HW03.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 17 16:05:47 HW03.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
---------------
HW04
---------------
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Thu 2019-07-25 15:00:33 HST; 4 days ago
Docs: man:firewalld(1)
Process: 3303 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
Main PID: 3303 (code=exited, status=0/SUCCESS)
Jul 25 15:00:32 HW04.ucera.local systemd[1]: Stopping firewalld - dynamic firewall daemon...
Jul 25 15:00:33 HW04.ucera.local systemd[1]: Stopped firewalld - dynamic firewall daemon.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
clush: HW[01-04] (4): exited with exit code 3
shows it to be inactive on all hosts.
At this point, not sure what is going wrong. Can anyone think of any debugging suggestions or fixes?
** Also, neither of the commands listed above worked when including the --bibtex
option in the command. Does anyone know why that would happen?
In the example you link to, see how the backquotes are backslashed? You need to do that or else hostname
gets executed in your shell on HW04, before it talks to other machines.
First off, I'd try this to see whether you are talking to those other machines at all:
parallel -j 5 \
--sshloginfile ./parallel-nodes.txt \
echo "Number {}: Running on \`hostname\`" ::: 1 2 3 4 5 6 7 8 9 10
Then, I'd try tracking down your passwordless ssh setup one machine at a time to make sure it's really working; from HW04 try:
parallel -S HW01 'echo -n {} ""; hostname' ::: 1
parallel -S HW02 'echo -n {} ""; hostname' ::: 1
parallel -S HW03 'echo -n {} ""; hostname' ::: 1
parallel -S HW04 'echo -n {} ""; hostname' ::: 1
(repeat for every machine in your parallel-nodes.txt
file)
If one of those machines isn't working with ssh, you can try to debug it with:
PARALLEL_SSH='ssh -v' parallel -S HW03 'echo -n {} ""; hostname' ::: 1