Spark graphFrames - Label Propagation vs. Strongly Connected Components

In the https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html standard example:

The Strongly Connected Components seem reasonable computationally when looking also at them visually on a drawing.

Therefore I am surprised at the Label Propagation for detecting "communities".

What am I missing? [A, D, E] I would have thought would be a community as well from the data and that results would be similar. I tried with more cycles. Label Propagation seems the poor cousin of "clustering".

Solution

What you have found here is a known phenomenon called label oscillation which occurs when the labels are synchronously updated and (sub)graphs have a bipartite structure (or star graphs). The two communities will endlessly exchange their labels and the LPA will never terminate by itself.

Labels at t:

Labels at t+1:

Labels at t+2:

... and so on.

As you have already mentioned this is not really what we would expect from a community detection algorithm as there are no edges within the communities. But this is still a fast algorithm which delivers good results for non bipartite structures. Raghavan proposed a fast alternative which uses asynchronous updating. But this is not yet implemented in Graphframes. Graphframes calls the graphX implementation of LPA (have a look at code code) which uses Pregel (have a look at code code) which is synchronous.

Besides of LPA problem for bipartite structures there is also a general difference between SCC and LPA:

SCC community: Every node of a community knows (has an edge) all other nodes within the community.
LPA community: For each node of a community there is a path (sequence of edges) within the community to all other nodes within a community.