Search code examples
linuxubuntumpioracle-cloud-infrastructure

Two MPI computing nodes cannot complete a TCP connection cause by firewall


I am trying to run a simple MPI example on two computing nodes node1 and node2, which are virtual machines I just created on Oracle Cloud. (It is the first time I used Oracle Cloud...) The system is Ubuntu 20.04. What I've done include:

  • node1 and node2 have the correct MPI environment (OpenMPI-4.1.0) under the same path. $PATH and $LD_LIBRARY_PATH have also been set. I can successfully run the MPI example on a single node.
  • Passwordless login between node1 and node2 has been setup. I can use ssh node1 and ssh node2 to connect one node to another.
  • There is a hostfile (hosts2) on the two nodes under the same path ($HOSTFILE_PATH/hosts2) containing
node1  slots=1
node2  slots=1
  • The executable files (test) are under the same path ($EXE_PATH/test).

Then I ran $(which mpirun) -n 2 -hostfile $HOSTFILE_PATH/hosts2 $EXEC_PATH/test, and I did not get any return. So I can only terminate the execution with ctrl+c. After I few minutes, I got the output:

 ------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    instance-1-632783
  Remote host:   instance-1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------

Is the problem related to the firewall? I tried sudo ufw status and got Status: inactive. I also tried sudo iptables -L, and got:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     icmp --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     udp  --  anywhere             anywhere             udp spt:ntp
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:ssh
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
InstanceServices  all  --  anywhere             link-local/16       

Chain InstanceServices (1 references)
target     prot opt source               destination         
ACCEPT     tcp  --  anywhere             169.254.0.2          owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     tcp  --  anywhere             169.254.2.0/24       owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     tcp  --  anywhere             169.254.4.0/24       owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     tcp  --  anywhere             169.254.5.0/24       owner UID match root tcp dpt:iscsi-target /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     tcp  --  anywhere             169.254.0.2          tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     udp  --  anywhere             169.254.169.254      udp dpt:domain /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     tcp  --  anywhere             169.254.169.254      tcp dpt:domain /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     tcp  --  anywhere             169.254.0.3          owner UID match root tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     tcp  --  anywhere             169.254.0.4          tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     tcp  --  anywhere             169.254.169.254      tcp dpt:http /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     udp  --  anywhere             169.254.169.254      udp dpt:bootps /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     udp  --  anywhere             169.254.169.254      udp dpt:tftp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
ACCEPT     udp  --  anywhere             169.254.169.254      udp dpt:ntp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */
REJECT     tcp  --  anywhere             link-local/16        tcp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */ reject-with tcp-reset
REJECT     udp  --  anywhere             link-local/16        udp /* See the Oracle-Provided Images section in the Oracle Cloud Infrastructure documentation for security impact of modifying or removing this rule */ reject-with icmp-port-unreachable

Then I tried sudo iptables -F, after that, sudo iptables -L showed:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain InstanceServices (0 references)
target     prot opt source               destination       

But it seems that sudo iptables -F temporarily deletes the policies. When I reboot the system, sudo iptables -L shows the former output. So how can I solve the firewall problem? Should I permanently delete the policies? And how?


Solution

  • Even if the VMs are in the same subnet you still have to allow traffic between them.

    So open the required ports in the security list of the subnet you are using (https://docs.oracle.com/en-us/iaas/Content/Network/Concepts/securitylists.htm#Security_Lists)

    If you don't know the needed ports you can open all ports (This is not a good practice for production environments).