Search code examples
amazon-web-servicesgraphtitanbigdata

Using & Scaling Titan Graph Database


I am figuring out my options for storing hierarchical data (parent - child relationships).

Since a tree is a graph and a forest (of trees) is also technically a graph, a graph database seems to fit the bill much better than a RDBMS esp. since I am concerned with optimizing both read and write operations.

  • Optimizing writes implies changes in hierarchy require minimal writes.
  • Optimizing reads implies materializing the full path to a particular node consumers minimal read operations.

My use case is:

  • A tree per user. Should I store and use one graph across the user space or one graph per user?
  • Path queries starting at any node and back to root of tree for a user.
  • Child nodes store links to parent nodes

Since all of my resources are in AWS, being able to use the Titan DynamoDB backend seems ideal.

My real problem is in understanding how to scale and manage Titan though.

  1. Do I need a gremlin server instance? In other words, do I need to stand up a EC2 instance with gremlin server in order to do anything with Titan? Or can I use the Java Titan API to work with graph data directly?

  2. Do I need to explicitly shard the data? In other words, do I need to stand up more gremlin servers as usage increases and the amount of data and the amount of operations increase? When the number of servers scale out, do I need to consistent hash across those servers from the client in order to perform operations?

  3. Do I need to setup an elastic search cluster to be able to start traversals from any node? Or is using vertices to represent objects and edges to represent parent relationships sufficient at this point? I can guarantee that vertex ID's are unique across the user-space ; I can also decorate each vertex with the unique user ID as well. In that case, do I need elastic search? My hope is that elastic search is for free form or more complex search type queries and not for exact queries!

  4. As the number of front-ends increase, can each front-end open the graph (single graph across user space)? If a graph per user, then since front-ends have no affinity, the same graph may be opened for each user; is that OK?

I wasn't able to find much documentation on any of this. Thank you!


Solution

  • I will try to answer your questions in the following:

    1. Both solutions are possible, it is highly depend on your application to decide between choosing gremlin server or having a customized data access layer with customized queries through other secondary data stores. Although I would prefer having customized data access layer, it is possible to response all gremlin query requirements through gremlin server.

    2. Gremlin server is just an interface between your application and data stores, and due to the caching mechanism it is memory-intensive. Data can be stored in different machines for example a cluster of DynamoDB machines. It depend on the number of concurrent users, but I think vertical scaling is more than enough for most of the applications. If you are going to use titan in a highly concurrent environment, beyond resource of single machine, probably you have to create different gremlin-servers on different machines and handle the load balancing mechanism. The problem is you have to control sending requests in a way that similar queries hit the same gremlin-server from the cache efficiency point of view.

    3. Yes, indexing backend is just useful for more complicated queries other than simple retrieval. Secondary index backend like Solr/ Elastic or Lucene is useful if you want to have conditional search or text search by similarity. It is because that indexer like Lucene can provide a reverse index structure that can be helpful for similar searches. If you are going to search for all parents/children with having "foo" in their names you have to use indexing backend. If you are going to search for all parents/children with age less than 40, you have to use indexing backend too. More information about indexing backends could be accessed via these link. http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html http://s3.thinkaurelius.com/docs/titan/1.0.0/index-parameters.html

    4. It is highly recommended to limit the number of open graphs to one for the entire application. Titan uses some caching mechanisms that encourage you to have a single graph instance in the entire application for the sake of performance. Since uncommitted data is just visible on a single graph instance and transaction, if you want real-time application it is suggested that use single graph instance and single transaction. However, using more than 1 graph instance in the entire application for a read-only transaction is not wrong, but it is not efficient.

    You can find lots of information about Titan graph database in the following links:

    Main Titan documentaion: http://s3.thinkaurelius.com/docs/titan/1.0.0/

    An old but really useful document about how Titan works: https://github.com/elffersj/delftswa-aurelius-titan/tree/master/SA-doc