Search code examples
javasimgrid

SimGrid. Asynchronous communications and failing links


Simulation has one master and seven workers. When workers end to execute data, they dsend messageTasks to master about completion of execution.

 getHost().setProperty("busy", "no");
 ReleaseTask releaseTask = new ReleaseTask(getHost().getName());
 releaseTask.dsend("Master");

The link connects worker1 and master is broken. It is link1.fail file.

PERIODICITY 2
0 1
1 0

I expected that only one releaseTask (from worker1) can't reach master. But, unfortunately, no releaseTasks (from other workers) achieve master. This error-warning appears:

[13.059397] /builds/workspace/SimGrid-Multi/build_mode/Debug/node/simgrid-ubuntu-trusty-64/build/SimGrid-3.13/src/simix/smx_global.cpp:554: [simix_kernel/CRITICAL] Oops ! Deadlock or code not perfectly clean.
[13.059397] [simix_kernel/INFO] 16 processes are still running, waiting for something.

Master receive task in such way:

Task listenTask = Task.receive("Master");

When link connects worker1 and master isn't broken, all simulation works fine.

How can I avoid this problem?

UPDATED

My platform.xml file:

<link id="0_11" state_file="linkfailures/0_11.fail" bandwidth="3.430125Bps" latency="4.669142ms"/>

0_11.fail file:

PERIODICITY 2
0 1
1 0

Worker starts to dsend a MessageTask to master at 6.94 s. MessageTask transmission time is 0.07 sec. But at 7.00 s. the link which connects master and worker starts to be broken. I guess master continues timeless "receiving" data and error occurs. But how to handle it?


Solution

  • If you send your data with dsend, it only means that you don't care of whether the receiver gets it or whether an error occurs. It does not make the communication more robust (nor less robust either).

    You updated your question, giving two possible outcomes to your simulation. Sometimes you say that no communication makes it to master and that the simulation ends when SimGrid reports a deadlock (16 processes are still running, waiting for something), and sometimes you report a that a TransferFailureError is occurring. But actually, that's exactly what is expected in your case, if I'm right.

    Here is what happens:

    • you send a message with dsend
    • the message get lost because the link fails. Nope, it does not take for ever to deliver because the link fails, it just disappear immediately.

    At this point there is two possible outcomes, depending on whether the link fails before or after the communication starts (before or after the receiver posts its recv).

    • If the link fails before the time where the receiver (the master in your case, it seems) posts its recv request, then the failure will not be noticed. Indeed, there is no receiver yet to inform and the sender said that it does not care about the communication outcome, by using a dsend.
    • If the link fails after the time where the receiver posts its request, then the sender does not notice anything (because of the dsend), and the receiver gets a TransferFailureException on its receive action. So the failing communication is killing someone even if you sent it with dsend, but actually that's the master who dies. That is why the other slaves cannot communicate with the master: he got an uncatched exception while receiving something from the fishy host.

    If you want the sender to notice that your message did not went through (to resend it maybe), then you don't want to use dsend but isend (for an asynchronous communication) or send (for a blocking communication). And the sender has to pay attention for the status of the communication.

    If you want your message to be really delayed but not destroyed, then try changing the bandwidth of the link to 0 for a while (using availability_file instead if state_file).

    If you want your receiver to survive such communication issue, just catch the exception it gets.