Search code examples
titangremlintinkerpopgiraphspark-graphx

Gremlin - Giraph - GraphX ? On TitanDb


I need some help to be confirm my choice... and to learn if you can give me some information. My storage database is TitanDb with Cassandra. I have a very large graph. My goal is to use Mllib on the graph latter.

My first idea : use Titan with GraphX but I did not found anything or in development in progress... TinkerPop is not ready yet. So I have a look to Giraph. TinkerPop, Titan can communique with Rexster from TinkerPop.

My question is : What are the benefit to use Giraph ? Gremlin seems to do the same think and is distributed.

Thank you very much to explain me. I think I don't really understand the difference between Gremlin and Giraph (or GraphX).

Have a nice day.


Solution

  • Interesting question. I am on the same track.

    First your question about MLlib. I assume that you mean Apache Spark MLlib, the machine learning (ML) implementation on top of Apache Spark. So my conclusion is: you want to run ML algorithms for purposes such as clustering and classification using the data in your Titan/Cassandra based graph database. Please note that you could also use graph processing algorithms like Page Rank mentioned by spidy to do things like clustering on top of your Titan/Cassandra graph database. In other words: you don't need ML to do clustering when your starting point is a graph database.

    Apache Spark MLlib seems to be future proof and widely supported, their most recent announcements were regarding new ML algorithms, although Apache Mahout, another Apache ML project, is more mature regarding the amount of supported ML algorithms. Apache Mahout has also adopted Apache Spark as their data storage layer, so I therefore mention it in this post. Apache Spark offers, in addition to in-memory computing, the mentioned MLlib for machine learning, Spark SQL which is like Hive on Spark, GraphX which is a graph processing system as explained by spidy and Spark Streaming for processing of streaming data.

    I consider Apache Spark itself as a logical data layer, represented as RDDs (Resilient Distributed Datasets) on top of storage layers such as Cassandra, Hadoop/Hcatalog and HBase. Apache Spark offers a connector to Cassandra. Note that RDDs are immutable, you cannot alter data using Spark, you can only process and analyze the data in Spark. Regarding the Apache Spark logical storage layer RDD: You could compare an RDD as a view in the good old SQL times, RDDs give you a view on for example a table in Cassandra of HBase. Note also that Apache Spark offers an API for 3 development environments: Scala, Java and Python.

    Apache Giraph is also a graph processing toolset, functional equivalent to Apache Spark GraphX. Apache Giraph uses Hadoop as the data storage layer. You are using Titan/Cassandra so you will probably enter data migration tasks when you select Apache Giraph as your solution. Secondly, you started your post with a question regarding ML using MLlib and Apache Giraph is not a ML solution.

    Your conclusion regarding Giraph and Gremlin is not correct: they are not the same although both are using a graph database. Giraph is a solution for graph processing as spidy explained. Using Giraph you can execute graph analysis algorithms such as Page Rank, e.g. who has the most followers, whilst Gremlin is meant for traversing e.g. query the graph database using the complex relationships (edges) between entities (vertices) obtaining result sets of vertex and edge properties.