Search code examples
azure-cosmosdbgremlingraph-databasesazure-cosmosdb-gremlinapi

Can I have O(1000s) of vertices connecting to a single vertex and O(1000s) of properties off a vertex for Cosmos DB and/or graph databases?


I have a graph with the following pattern:

- Workflow:

-- Step #1
--- Step execution #1
--- Step execution #2
    [...]
--- Step execution #n

-- Step #2
--- Step execution #1
--- Step execution #2
    [...]
--- Step execution #n

[...]

-- Step #m
--- Step execution #1
--- Step execution #2
    [...]
--- Step execution #n

I have a couple of design questions here:

  1. How many execution documents can hang off a single vertex without affecting performance? For example, each "step" could have hundreds of 'executions' off it. I'm using two edges to connect them—'has_runs' (from step → execution) and 'execution_step' (from execution → step).

    Are graph databases (Cosmos DB or any graph database) designed to handle thousands of vertexes and edges associated with a single vertex?

  2. Each 'execution' has (theoretically) unlimited properties associated with it, but it is probably 10 < x < 100 properties. Is that OK? Are graph databases made to support such a large number properties off a vertex?

    All the demos I've seen seem to have < 10 total properties.


Solution

  • Is it appropriate to have so many execution documents hanging off a single vertex? E.g. each "step" could have 100s of 'executions' off it.

    Having 100s of edges from a single vertex is not atypical and sounds reasonable. In practice, you can easily find yourself with models that have millions of edges and dig yourself into the problem of supernodes at which point you would need to make some design choices to deal with such things based on your expected query patterns.

    Each 'execution' has (theoretically) unlimited properties associated with it, but is probably 10 < x < 100 properties. Is that ok? Are graph databases made to support many, many properties off a vertex?

    In designing a schema, I think graph modelers tend to think in terms of graph elements (i.e. vertices/edges) as having the ability to hold unlimited properties, but in practice they have to consider the capabilities of the graph system and not assume them all to be the same. Some graphs, like TinkerGraph will be limited only by available memory. Other graphs like JanusGraph will be limited by the underlying data store (e.g. Cassandra, Hbase, etc).

    I'm not aware of any graph system that would have trouble with storing 100 properties. Of course, there's caveats to all such generalities - a few examples:

    1. 100 separate simple primitive properties of integers and Booleans is different than 100 byte arrays each holding 100 megabytes of data.
    2. Storing 100 properties is fine on most systems, but do you intend to index all 100? On some systems that might be an issue. Since you tagged your question with "CosmosDB", I will offer that I don't think they are too worried about that since they auto-index everything.
    3. If any of those 100 properties are multi-properties you could put yourself in a position to create a different sort of supernode - a fat vertex (a vertex with millions of properties).

    All that said, generally speaking, your schema sounds reasonable for any graph system out there.