Search code examples
gremlinamazon-neptune

How can I get the same results from Amazon Neptune that I do from Cosmos DB?


Using Gremlin.Net 3.3.2 I am getting very different results from Neptune and Cosmos DB. Graph data is the same on both platforms. Cosmos DB gives me everything I need (vertex id, label and properties).

When query is made to Neptune using Gremlin.Net I only get the vertex Id and label. Is this a bug with Neptune and Gremlin.net? Bug with Neptune?

If execute the query in the gremlin console Neptune returns all the data so problem appears to be confined to Gremlin.Net.

query = g.V().has('name',within('wind'))

Amazon Neptune results
{
  "Id": "14b15642-842f-888a-a28e-3ed117a07d5b",
  "Label": "keyword"
}

Cosmos DB results
{
  "id": "wind",
  "label": "keyword",
  "type": "vertex",
  "properties": {
    "popularity": [
      {
        "id": "8f9967f1-cead-41d6-a432-de025d9dc14b",
        "value": "16"
      }
    ],
    "name": [
      {
        "id": "fb90af3f-828b-4cc0-b9f8-b571a30c6b14",
        "value": "wind"
      }
    ]
  }
}

Solution

  • Neptune is a bit more in line with the expected output that TinkerPop itself would provide, whereas CosmosDB returns an older approach. TinkerPop recommends the return of "references" to graph elements (i.e. id and label and not properties) and that appears to be what Neptune provides. I don't know if Neptune can be configured to behave differently.

    While it may not seem convenient, the reason TinkerPop recommends this approach is that users should only return the data that they request. For instance, you typically wouldn't do SELECT * FROM table for a SQL query - you would include the fields that you wanted returned in the SELECT clause. For the same reasons you do that in SQL, you would do that in Gremlin.

    Also, returning all properties on an element could be massively expensive. It's hard for TinkerPop to recommend returning anything other than a reference because of multi-properties. If a Vertex can hold millions of properties, the last thing we'd want to see happen is for the element to default serialize with all of those properties.

    Unfortunately, much of this thinking was not clear in the TinkerPop community when we started down the path of defining IO formats. OLAP was still a bit of an experiment, GLVs were not a thought, etc. and so the idea of "reference elements as a default" didn't come until in later releases. Hopefully we can make the IO more consistent for TinkerPop 4.x some day.

    The way to get the same results would be to follow TinkerPop's recommendations and avoid returning graph elements. The best approach would probably be to use project() or valueMap() in some form:

    g.V().valueMap('popularity','name')
    g.V().
      project('popularity','name').
        by('popularity').
        by('name')
    

    Note while project() is a bit more verbose in the example it does provide a more compact output because it doesn't embed the value of each key in a List the way valueMap() does. The above will coerce results to Map so that they will be consistent across all platforms.