Search code examples
networkingnetwork-programmingcluster-computingmpiopenmpi

Cluster hangs/shows error while executing simple MPI program in C


I am trying to run a simple MPI program(multiple array addition), it runs perfectly in my PC but simply hangs or shows the following error in the cluster. I am using open mpi and the following command to execute

Netwok Config of the cluster(master&node1)

            MASTER
eth0      Link encap:Ethernet  HWaddr 00:22:19:A4:52:74  
          inet addr:10.1.1.1  Bcast:10.1.255.255  Mask:255.255.0.0
          inet6 addr: fe80::222:19ff:fea4:5274/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16914 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7183 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2050581 (1.9 MiB)  TX bytes:981632 (958.6 KiB)

eth1      Link encap:Ethernet  HWaddr 00:22:19:A4:52:76  
          inet addr:192.168.41.203  Bcast:192.168.41.255  Mask:255.255.255.0
          inet6 addr: fe80::222:19ff:fea4:5276/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:701 errors:0 dropped:0 overruns:0 frame:0
          TX packets:228 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:75457 (73.6 KiB)  TX bytes:25295 (24.7 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:88362 errors:0 dropped:0 overruns:0 frame:0
          TX packets:88362 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:21529504 (20.5 MiB)  TX bytes:21529504 (20.5 MiB)

peth0     Link encap:Ethernet  HWaddr 00:22:19:A4:52:74  
          inet6 addr: fe80::222:19ff:fea4:5274/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:17175 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7257 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2373869 (2.2 MiB)  TX bytes:1020320 (996.4 KiB)
          Interrupt:16 Memory:da000000-da012800 

peth1     Link encap:Ethernet  HWaddr 00:22:19:A4:52:76  
          inet6 addr: fe80::222:19ff:fea4:5276/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1112 errors:0 dropped:0 overruns:0 frame:0
          TX packets:302 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:168837 (164.8 KiB)  TX bytes:33241 (32.4 KiB)
          Interrupt:16 Memory:d6000000-d6012800 

virbr0    Link encap:Ethernet  HWaddr 52:54:00:E3:80:BC  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
            
                NODE 1
eth0      Link encap:Ethernet  HWaddr 00:22:19:53:42:C6  
          inet addr:10.1.255.253  Bcast:10.1.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16559 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7299 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1898811 (1.8 MiB)  TX bytes:1056294 (1.0 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:25 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:3114 (3.0 KiB)  TX bytes:3114 (3.0 KiB)

peth0     Link encap:Ethernet  HWaddr 00:22:19:53:42:C6  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:16913 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7276 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2221627 (2.1 MiB)  TX bytes:1076708 (1.0 MiB)
          Interrupt:16 Memory:f8000000-f8012800 

virbr0    Link encap:Ethernet  HWaddr 52:54:00:E7:E5:FF  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

Error

mpirun -machinefile machine -np 4 ./query
error code:
[[22877,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.122.1 failed: Connection refused (111)

Code

#include    <mpi.h>
#include    <stdio.h>
#include    <stdlib.h>
#include    <string.h>
#define     group           MPI_COMM_WORLD
#define     root            0
#define     size            100

int main(int argc,char *argv[])
{
int no_tasks,task_id,i;
MPI_Init(&argc,&argv);
MPI_Comm_size(group,&no_tasks);
MPI_Comm_rank(group,&task_id);
int arr1[size],arr2[size],local1[size],local2[size];
if(task_id==root)
{
    for(i=0;i<size;i++)
    {
        arr1[i]=arr2[i]=i;
    }
}
MPI_Scatter(arr1,size/no_tasks,MPI_INT,local1,size/no_tasks,MPI_INT,root,group);
MPI_Scatter(arr2,size/no_tasks,MPI_INT,local2,size/no_tasks,MPI_INT,root,group);
for(i=0;i<size/no_tasks;i++)
{
    local1[i]+=local2[i];
}
MPI_Gather(local1,size/no_tasks,MPI_INT,arr1,size/no_tasks,MPI_INT,root,group);
if(task_id==root)
{       
    printf("The Array Sum Is\n");
    for(i=0;i<size;i++)
    {
        printf("%d  ",arr1[i]);
    }
}
MPI_Finalize();
return 0;
}

Solution

  • Tell Open MPI not to use the virtual bridge interface virbr0 interface for sending messages over TCP/IP. Or better tell it to only use eth0 for the purpose:

    $ mpiexec --mca btl_tcp_if_include eth0 ...
    

    This comes from the greedy behaviour of Open MPI's tcp BTL component that transmits messages using TCP/IP. It tries to use all of the available network interfaces that are up on each node in order to maximise the data bandwidth. Both nodes have virbr0 configured with the same subnet address. Open MPI falls to recognise that both addresses are equal, but since the subnets match, it assumes that it should be able to talk over virbr0. So process A is trying to send a message to process B, which resides on the other node. Process B listens on port P and process A knows this, so it tries to connect to 192.168.122.1:P. But this is actually the address given to the virbr0 interface on the node where process A is, so the node tries to talk to itself on a non-existent port, hence the "connection refused" error.