asynchronous distributed-computing middleware

Why are Distributed Systems considered complex?

I'm just getting into the concept of a Distributed System and its advantages and disadvantages. In the book I'm reading it discusses the complexity of a Distributed System and that they are inherently complex, it lists the following as potential reasons for complexity;

Heterogeneity
Asynchronous communication
Partial failures

What I am struggling to understand is what these concepts actually encompass (i.e what is a partial failure and what are the causes of a partial failure?), and how they are dealt with in modern systems? Does middleware successfully solve all three of these complexity issues within a system?

Solution

This question can be answered in many words, but I'll try to boil it down to essentials:

Heterogeneity is one of the main problems integration tries to solve. It is an inherent characteristic of most distributed systems and it refers to the fact that most often than not, when you have to integrate multiple systems, they will:

Be on different platforms, in different networks;
Differ in their capabilities in terms of integration;
Have disparities in data, even data referring to the same business domain;
Use and support different (sometimes even forgotten or unsupported) technologies and standards;
Have different owners (are controlled by different departments, companies).

All of the above add more and more complexity.

Asynchronous communication solves some problems of stateless communication but introduces whole other set of complexities, that can easily lead to problems when not implementation is not proper. This is mainly due to the fact that you only have guarantee that the message will be successfully received on the other end, but have no guarantee whatsoever when the operation will be processed, if ever. So it is much harder to carry out orchestration of interdependent asynchronous tasks, as opposed to synchronous tasks.

Partial failures - When you have processes that involve multiple interdependent write operations you need to ensure ACID transactions. Having to do so in scenarios when multiple systems are involved is even harder because you cannot achieve common transactional context as easily in heterogeneous distributed environment as you would if you were within the boundaries of a single system. Often you will need to implement opposite operations in services (or worse, implement two-phase commit), just to be able to compensate all prior writes in the process in case something goes wrong with one of the tasks.

Hope this clears things a bit!