Search code examples
apache-sparkgremlingraph-databasesspark-graphx

Gremlin traversal queries on spark graph


I have build a property graph(60 million nodes, 40 million edges) from s3 using Apache Spark Graphx framework. I want to fire traversal queries on this graph.

My queries will be like:-

g.V().has("name","xyz").out('parent').out().has('name','abc')
g.V().has('proc_name','serv.exe').out('file_create').
has('file_path',containing('Tsk04.txt')).in().in('parent').values('proc_name')
g.V().has('md5','935ca12348040410e0b2a8215180474e').values('files')

mostly queries are of form g.V().out().out().out()

Such queries are easily possible on graph db's like neo4j,titan,aws neptune since they support gremlin.

Can we traverse spark graphs in such manner. I tried spark pregel-api but its bit complex as compared to gremlin.

Reason I am looking for spark graph is because cloud solutions of above mentioned graphdbs is costly.


Solution

  • Spark GraphFrames library should be most convenient for you. it provides neo4j-cypher-like traversal description and use Spark DataFrames api for filtering
    https://graphframes.github.io/graphframes/docs/_site/user-guide.html#motif-finding Here is an example:

    val g2: GraphFrame = GraphFrame.fromGraphX(gx) // you can start with just V and E DataFrames here
    val motifs: GraphFrame = g.find("(a)-[e]->(b); (b)-[e2]->(c)")
    motifs.filter("a.name = 'xyz'  and e.label = 'parent' and c.name = 'abc'").show()
    

    TinkerPop it self has spark support, so you can issue spark OLAP queries from gremlin console https://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer

    Or there are some close source solutions. Datastax Enterprise Database has a good Gremlin support for spark: https://www.datastax.com/blog/2017/05/introducing-dse-graph-frames I'm a former author of it