Do we need to index on relationship properties to ensure that Neo4j will not search through all relationships

To clarify, let's assume that I have a relationship type: "connection." Connections has a property called: "typeOfConnection," which can take on values in the domain:

{"GroupConnection", "FriendConnection", "BlahConnect"}.

When I query, I may want to qualify connection with one of these types. While there are not many types, there will be millions of connections with each property type.

Do I need to put an index on connection.typeOfConnection in order to ensure that all connections will not be traversed?

If so, I have been unable to find a simple cypher statement to do this. I've seen some stuff in the documentation describing how to do this in Java, but I'm interacting with Neo using Py2Neo, so it would be wonderful if there was a cypher way to do this.

Solution

This is a mixed granularity property graph data model. Totally fine, but you need to replace your relationship qualifiers with intermediate nodes. To do this, replace your relationships with one type node and 2 relationships so that you can perform indexing.

Your model has a graph with a coarse-grained granularity. The opposite extreme is referred to as fine-grained granularity, which is the foundation of the RDF model. With property graph you'll need to use nodes in place of relationships that have labels applied by their type if you're going to do this kind of coarse-grained graph.

For instance, let's assume you have:

MATCH (thing1:Thing { id: 1 })-->(:Connection { type: "group" }),
      (group)-->(thing2:Thing)
RETURN thing2

Then you can index on the label Connection by property type.

CREATE INDEX ON :Connection(type)

This allows you the flexibility of not typing your relationships if your application requires dynamic types of connections that prevent you from using a fine-grained granularity.

Whatever you do, don't work around your issue by dynamically generating typed relationships in your Cypher queries. This will prevent your query templates from being cached and decrease performance. Either type all your relationships or go with the intermediate node I've recommended above.