Simulation has one master
and seven workers
. When workers end to execute data, they dsend
messageTasks
to master
about completion of execution.
getHost().setProperty("busy", "no");
ReleaseTask releaseTask = new ReleaseTask(getHost().getName());
releaseTask.dsend("Master");
The link connects worker1
and master
is broken. It is link1.fail file.
PERIODICITY 2
0 1
1 0
I expected that only one releaseTask
(from worker1) can't reach master
. But, unfortunately, no releaseTasks
(from other workers) achieve master
. This error-warning appears:
[13.059397] /builds/workspace/SimGrid-Multi/build_mode/Debug/node/simgrid-ubuntu-trusty-64/build/SimGrid-3.13/src/simix/smx_global.cpp:554: [simix_kernel/CRITICAL] Oops ! Deadlock or code not perfectly clean.
[13.059397] [simix_kernel/INFO] 16 processes are still running, waiting for something.
Master receive task
in such way:
Task listenTask = Task.receive("Master");
When link connects worker1
and master
isn't broken, all simulation works fine.
How can I avoid this problem?
UPDATED
My platform.xml
file:
<link id="0_11" state_file="linkfailures/0_11.fail" bandwidth="3.430125Bps" latency="4.669142ms"/>
0_11.fail
file:
PERIODICITY 2
0 1
1 0
Worker starts to dsend
a MessageTask
to master at 6.94 s. MessageTask
transmission time is 0.07 sec. But at 7.00 s. the link which connects master and worker starts to be broken. I guess master continues timeless "receiving" data and error occurs. But how to handle it?
If you send your data with dsend
, it only means that you don't care of whether the receiver gets it or whether an error occurs. It does not make the communication more robust (nor less robust either).
You updated your question, giving two possible outcomes to your simulation. Sometimes you say that no communication makes it to master and that the simulation ends when SimGrid reports a deadlock (16 processes are still running, waiting for something
), and sometimes you report a that a TransferFailureError
is occurring. But actually, that's exactly what is expected in your case, if I'm right.
Here is what happens:
dsend
At this point there is two possible outcomes, depending on whether the link fails before or after the communication starts (before or after the receiver posts its recv
).
recv
request, then the failure will not be noticed. Indeed, there is no receiver yet to inform and the sender said that it does not care about the communication outcome, by using a dsend
.dsend
), and the receiver gets a TransferFailureException
on its receive action. So the failing communication is killing someone even if you sent it with dsend, but actually that's the master who dies. That is why the other slaves cannot communicate with the master: he got an uncatched exception while receiving something from the fishy host.If you want the sender to notice that your message did not went through (to resend it maybe), then you don't want to use dsend
but isend
(for an asynchronous communication) or send
(for a blocking communication). And the sender has to pay attention for the status of the communication.
If you want your message to be really delayed but not destroyed, then try changing the bandwidth of the link to 0 for a while (using availability_file
instead if state_file
).
If you want your receiver to survive such communication issue, just catch the exception it gets.