Search code examples
neo4jcypherneo4j-apoc

Index queries worsen query performance


I tried to use indexes on one of my queries to make it run faster, but it makes another query run slower. So I have one query for the data processing? which run faster with indexes and another one for data production from Neo4j (with Kafka) and started to run slower with indexes. I performed index creation before the data processing and its removal when I needed to produce the data, but it was not an effective technique. How can Neo4j query actually run slower with indexes?

Here is the query for the data production:

 MATCH (m:Member)-[mtg_r:MT_TO_MEMBER]->(mt:MemberTopics)-[mtt_r:MT_TO_TOPIC]->(t:Topic), (t1:Topic)-[tt_r:GT_TO_TOPIC]->(gt:GroupTopics)-[tg_r:GT_TO_GROUP]->(g:Group)-[h_r:HAS]->(e:Event)-[a_r:AT]->(v:Venue) 
WHERE mt.topic_id = gt.topic_id AND distance(point({ longitude: m.lon, latitude: m.lat}),point({ longitude: v.lon, latitude: v.lat })) < 4000 
RETURN distinct mt.member_id as member_id, m.lat as member_lat, m.lon as member_lon

Performance of the query without indexes:

enter image description here

6432058 total db hits in 1888 ms

Profile plan for the query with indexes:

enter image description here

138425061 total db hits in 149617 ms

Index queries look like this:

  CREATE INDEX ON:MemberTopics(member_id)

  CREATE INDEX ON:MemberTopics(topic_id)

  CREATE INDEX ON:GroupTopics(topic_id)

Solution

  • The reason the indexed query performs worse, is because the Cypher planner was wrong about the amount of work it was committing to (Estimated row count vs actual row count).

    With no index, the planner knows it will have to do a label scan of MemberTopics, and it's internal statistics say that will come to the tune of ~3 million rows (and it is right about that). With an index on Membertopics, the planner's internal statistics said it would probably need ~2k rows if it used the index, and the actual outcome was ~70 million rows... oops! This is partially because the planner underestimated how many rows it would be working with by this point of the query, but I can't be sure exactly why it was off by so many orders of magnitude. If you remove the Member node from the cypher and change the return to RETURN *, it looks like you would get the answer to that question though.


    This is partially because your cypher/data itself is a monster to plan for.

    • You match 2 Topic nodes (t and t1) that are not used elsewhere
    • WHERE mt.topic_id = gt.topic_id is a foreign key reference, and should never appear in a cypher; and really is the lynch pin in why your performance is terrible. Either topic_id should be it's own node, or you should have a direct relationship between mt and gt. Neo4j is really good at walking relationship edges. It's terrible at foreign keys (compared to walking along relations).
      • You have 2 long paths, with no clear indication of how they are related (from the Cypher planner's view).
    • Your MATCH is a pair of long path chains with no clear start point. Long paths that don't lock one of the nodes down to a specific (few) node(s) are almost guaranteed to explode your query row count, and thus your performance.

    As for what you can do about it,

    • Limit the scope of your query

    to a single member, or group, or area

    • Split your query into stages using WITH

    With creates a logical partition of the Cypher, and the planner will mostly try to resolve everything before the WITH before continuing with the rest of the logic. You know your data better than the planner, so this is a good way to limit the scope of your query without changing the results. (In your case, probably start with just Member and Venue, and filter from there.)

    • Don't use foreign keys

    Replace mt.topic_id = gt.topic_id with a relationship or intermediate node.