Search code examples
graphgraph-databasestitangremlin

Modeling values in graph DB - vertex or property?


I am modeling a data set in a graph database (Titan 0.5.2 on top of Cassandra) which has entities (represented by vertices) and two types of properties - link between the entities (naturally represented by edges) and scalar property (like string or number). There are a number of property types (about 2000 now), each property type is always of the same kind (i.e., property P1 is always link and property P2 is always string) but each entity can have any set of properties and properties can be repeated (i.e., entity E1 can have three P2 values and no P1 values).

The question is how to best model the scalar values of P2 - should they be part of the entity vertex E1? A property on the edge between entity vertex E1 and property vertex P2? An edge between E1 and value vertex containing the actual value, labeled P2? Something else? I am interested mainly in performance considerations for each solution - i.e., is it better to have a lot of properties on vertices or "thin" vertices but a lot of them and a lot of edges? Is there a difference for indexing them? But also I'm interested in other considerations such as convenience of querying, etc.

The data set is in tens of millions of entities (but will potentially grow, probably to hundreds of millions) and each vertex usually has about 10-20 properties, but some vertices can have more properties, i.e. hundreds or more. The queries anticipated could use any property, both the fact it is present and its value, and may also require calculations like "the greatest P2 value for this entity" or "does this entity has any P2 value which satisfies certain condition". The querying is planned to be done by Gremlin-type queries, but using Titan-only features is not excluded if it helps.


Solution

  • Personally, I think it is typically the most natural to model vertex properties as properties of the vertex. This is even more true when using Titan's new multi-property and meta-property features. Multi-properties are LIST/SET properties, and meta-properties are properties on a property. Here are the relevant docs that fully describe this:

    http://s3.thinkaurelius.com/docs/titan/0.9.0-SNAPSHOT/advanced-schema.html#_multi_properties

    http://www.tinkerpop.com/docs/3.0.0-SNAPSHOT/#vertex-properties

    You can create vertex-centric indexes on the properties to enable queries like "greatest P2 value for this entity". In terms of performance, this solution should work very well.

    http://s3.thinkaurelius.com/docs/titan/0.5.0-SNAPSHOT/indexes.html#vertex-indexes

    By default Titan will only retrieve the properties you ask for (unless you specifically tell it not to via query.fast-property), and it can do this all within the row of the vertex, so it is fast.The mechanics of that are described here

    http://s3.thinkaurelius.com/docs/titan/0.9.0-SNAPSHOT/data-model.html

    The one thing you have to watch out for is vertex rows that grow out of control. You mentioned that a vertex might have 100's of properties, and that sounds fine. If you start to get into 100K's then you can run into problems working with the vertex, especially when performing OLAP operations.

    The other thing to watch out for is that Edge properties do not have the same features as a Vertex.