Search code examples
apache-sparkgraphframes

Spark graphFrames - Label Propagation vs. Strongly Connected Components


In the https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-scala.html standard example:

  • The Strongly Connected Components seem reasonable computationally when looking also at them visually on a drawing.

enter image description here

  • Therefore I am surprised at the Label Propagation for detecting "communities".

enter image description here

What am I missing? [A, D, E] I would have thought would be a community as well from the data and that results would be similar. I tried with more cycles. Label Propagation seems the poor cousin of "clustering".


Solution

  • What you have found here is a known phenomenon called label oscillation which occurs when the labels are synchronously updated and (sub)graphs have a bipartite structure (or star graphs). The two communities will endlessly exchange their labels and the LPA will never terminate by itself.

    Labels at t: oscillation at t

    Labels at t+1: oscillation at t+1

    Labels at t+2: oscillation at t

    ... and so on.

    As you have already mentioned this is not really what we would expect from a community detection algorithm as there are no edges within the communities. But this is still a fast algorithm which delivers good results for non bipartite structures. Raghavan proposed a fast alternative which uses asynchronous updating. But this is not yet implemented in Graphframes. Graphframes calls the graphX implementation of LPA (have a look at code code) which uses Pregel (have a look at code code) which is synchronous.

    Besides of LPA problem for bipartite structures there is also a general difference between SCC and LPA:

    • SCC community: Every node of a community knows (has an edge) all other nodes within the community.
    • LPA community: For each node of a community there is a path (sequence of edges) within the community to all other nodes within a community.