Advice on how to use GraphX (use-case in the description below)

I have a dataset where each row has three features <src, dest, traceID>. The row represents a single edge (from source to destination) and the ID of the trace it belongs to. Note that these traces are invocation of microservices collected from an observability tool such as Jaeger. So there could be multiple traces (with different traceids) but the same edge connections. I want to achieve the following: 1.Parse each trace separately into a graph. 2.Group graphs which are the same structure. 3.Dump a representative graph from each group and the count that graph occurs in my dataset. Note that I have 2 million such graphs (average number of nodes in each graph is 15). Is GraphX suitable for such a problem?

I am currently parsing this as an edge RDD but I am not sure how to parse each graph separately. Should I have multiple graph objects for each graph?

Solution

For what you want there is a lot of functionality that is not there in GraphX IMO.

To address problems similar to yours in my work, I have developed a Pyspark package called splink_graph which can handle the tasks you're aiming to achieve when in a Spark cluster environment.

Firstly I will define the way of how I would approach this problem that you have.

Get all the edges in a structure that is appropriate
Perform Connected Components in order to ascertain the composition of the resultant subgraphs of a disconnected graph that can be created from the set of edges you have
Find a way to identify similar graphs
Group-by and count by kind-of-graph

While you could likely execute the first two steps using GraphX (connected components in GraphX docs link), it's not capable of handling the latter two out of the box.

With splink_graph, you could:

Load the edge list into a dataframe.
Execute the connected components algorithm leveraging GraphFrames for scalable operations.
Use the weisfeiler-lehman graphhash functionality provided by splink_graph for quick graph isomorphism testing, which corresponds to what you're seeking in graph theory terms.
Perform a group_by(graphhash).count() operation in order to get numbers on the kind of graphs you have.

By following this approach, you should be able to accomplish what you're seeking to do.

Of course what I propose is Python/Pyspark based and not Scala. If that is an issue I would propose implementing functions in Scala/Spark for the connected compenent & weisfeiler-lehman graph hash functionality