Search code examples
networkxgraph-databasesamazon-neptunenetwork-analysisgremlinpython

Neptune-Gremlin-Python | Best practises for scaling network analysis and serving use cases like recommendations in realtime


I have a generic question around the best practises on usage of Neptune DB as a network database and its ability to scale up for complex computing. I want to develop a user recommendation system where incoming users on the platform are prompted other users they can likely follow in order to grow the network.

For implementing a simple technique like Triadic Closure, should I use gremlin queries on the Network DB(AWS Neptune in my case) for generating the recommendations? I believe in this case I would have to create python scripts that parallelise queries for multiple nodes and generate recommendation for each node at scale.

OR is it a more common practise to store the network data in the form of nodes, edges and their properties into a relational database, and then perform computations on the same by running SQL queries to load the network data into python, and then using packages like NetworkX on top of that. In this case I won't have to worry about batch computations since a relational database like Redshift would take care of it. However I would be writing python logics to implement techniques such as triadic closure.

Additionallly in the future I may want to use more complex graph computational techniques like graph clustering, partitioning, calculation of different kinds of centralities. Are all/any of these possible within the framework of Neptune+Gremlin.

With the above context below are the questions I am seeking answers for:

  1. Whats is the commonly used tech stack by a data science team working with graph data to build solutions such as user recommendations? By data-science tech stack I mean technologies that help query, analyse, visualise, compute and serve.

  2. Can Neptune + Gremlin replace python packages such as NetworkX for network analysis and centrality measurement?

  3. Is Neptune DB ideal only as a data store OR can it also support complex network analysis and recommendation serving?

Any insight/resources on this would be really helpful!


Solution

  • It is definitely possible to do triadic closure in Gremlin. I have also seen data scientists use both NetworkX and Gremlin together by running the gremlin-python client in a Jupyter Notebook. As this question is quite specific to Amazon Neptune you may want to post to the Neptune support forum at [1]. There are also some useful Gremlin Recipes at [2]

    If you post to the support forum I am sure someone will respond.