As an assignment I have to create a bunch of Docker containers and use pssh
to show inter-connectivity of the cluster. Now 10 containers I can manage juts fine, but when I start the process for 500 containers (which is the real assignment) I run into what seems like a randomly "broken" pipe. More importantly it happens during the pssh
inter-connectivity block of code. For ease of use I run the script in a VirtualBox VM running an Ubuntu 22.04 (as opposed to my windows host machine due to windows' issue of not being able to connect to docker containers through direct IP addressing). Earlier I stated the issue seems random and thats because just now it happened on the first set of connections but in an earlier attempt it happened after some 300 successful sets of connections. In my case the set of connections is one container using pssh
to connect to the remaining 499 containers while running the hostname
command (so altogether I run 500 * 499 connections making this a lengthy process and the issue that more annoying). Now the VM is slow yes but this takes in the ballpark of 7 hours to complete so debugging is key and I just don't know whats causing the error.
This is the full error:
[20] 16:30:43 [FAILURE] 0 Exited with error code 255 Stderr: The authenticity of host 0.0.0.0 (0.0.0.0)
xargs: unmatched single quote; by default quotes are special to xargs unless you use the -0 option
Traceback (most recent call last):
File "/usr/bin/pssh", line 119, in <module>
do_pssh(hosts, cmdline, opts)
File "/usr/bin/pssh", line 94, in do_pssh
statuses = manager.run()
^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/psshlib/manager.py", line 74, in run
self.update_tasks(writer)
File "/usr/lib/python3.12/site-packages/psshlib/manager.py", line 117, in update_tasks
if self.reap_tasks() == 0:
^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/psshlib/manager.py", line 139, in reap_tasks
self.finished(task)
File "/usr/lib/python3.12/site-packages/psshlib/manager.py", line 177, in finished
task.report(n)
File "/usr/lib/python3.12/site-packages/psshlib/task.py", line 291, in report
sys.stdout.flush()
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
A few things to note, there is no host 0.0.0.0 and as far as I've seen there is no unmatched single quote that would cause issues for the xargs
. The error (I think) is fully caused by the broken pipe error.
Here is the code in question (this is a bash script that automates the process of creating the docker image, creating the containers, testing each connection to the first container and then using pssh
to check all the inter-connections between containers), and it represents just a snippet of the whole script as (I at least think) the rest of the code is unrelated to the issue:
# execute pssh
echo -e "\n\033[92mStep 8: \033[92m\033[100mExecuting pssh to demonstrate all interconnections...\033[0m \033[33m"
for IP in $IPset; do
echo -en "\033[91m $IP \033[0m => \033[33m"
ssh -i keydir/my_key root@$IP pssh -i -x \"-i /root/.ssh/my_key\" -H \"$(echo $IPset | sed -e s/$IP//g)\" hostname -i | xargs
done #| cowsay -f cheese
A few notes for the code snippet: $IPset
is a variable containing all of the container IPs from creation and if i chose any single container and used ssh
and pssh
to connect to multiple containers, the result is without issue. I have ran this code countless times on less containers (usually 10) and that result is perfectly fine. And lastly the containers are running Alpine Linux hence why the use of pssh
command.
I've tried looking into the broken pipe error but as Ubuntu isn't my primary OS and bash isn't my strong suit I'm not sure which of the issues are affecting my script. I've seen some sites mention tmux
as a way of making sure the ssh
connections stay open longer and the pipe for xargs doesn't close (I believe this is the pipe that causes the problem), but again I checked the documentation and that mostly points out how you can run the terminal with multiple different processes and I don't want to waste time on this only for it to turn out the issue was fixable without it. Another important note is, the script isn't mine. I stated before I don't use bash and I couldn't come up with the script on my own. The script is the creation of my professor and if need be I can post the full script (if the pssh
snippet proves to be too little context). I don't want to remove the | xargs
pipe without someone more knowledgeable confirming that as a solution as the issue doesn't show up on smaller scales and I don't want to wait 7 hours only to be slapped by another error near the finish. Any help is appreciated and if anyone has any different ideas as to what might the cause be I'm open to ideas.
EDIT: I tried to remove the | xargs
pipe on a smaller scale and it only affects the formatting of the output, so that isn't the issue. I will look more into tmux
as I wait for people on SO to answer/comment on this post.
EDIT 2: I installed tmux
after seeing more posts claiming it can fix the issue and started a new run for the 500 containers. I am doubtful a random tmux
session will just fix this so I will wait with the verdict until it is done with the script.
EDIT 3: After about an hour of setting up the containers the script once again threw an error: [Errno 32] Broken pipe
. tmux
didn't solve the issue and now I am at a loss. This was the one big thing that could save me. Any help is appreciated.
It's been a few days and since I wasn't facing the issue anymore I'll share what I did. I simply restarted the VM and the error was gone.