I tried to use indexes on one of my queries to make it run faster, but it makes another query run slower. So I have one query for the data processing? which run faster with indexes and another one for data production from Neo4j (with Kafka) and started to run slower with indexes. I performed index creation before the data processing and its removal when I needed to produce the data, but it was not an effective technique. How can Neo4j query actually run slower with indexes?
Here is the query for the data production:
MATCH (m:Member)-[mtg_r:MT_TO_MEMBER]->(mt:MemberTopics)-[mtt_r:MT_TO_TOPIC]->(t:Topic), (t1:Topic)-[tt_r:GT_TO_TOPIC]->(gt:GroupTopics)-[tg_r:GT_TO_GROUP]->(g:Group)-[h_r:HAS]->(e:Event)-[a_r:AT]->(v:Venue)
WHERE mt.topic_id = gt.topic_id AND distance(point({ longitude: m.lon, latitude: m.lat}),point({ longitude: v.lon, latitude: v.lat })) < 4000
RETURN distinct mt.member_id as member_id, m.lat as member_lat, m.lon as member_lon
Performance of the query without indexes:
6432058 total db hits in 1888 ms
Profile plan for the query with indexes:
138425061 total db hits in 149617 ms
Index queries look like this:
CREATE INDEX ON:MemberTopics(member_id)
CREATE INDEX ON:MemberTopics(topic_id)
CREATE INDEX ON:GroupTopics(topic_id)
The reason the indexed query performs worse, is because the Cypher planner was wrong about the amount of work it was committing to (Estimated row count vs actual row count).
With no index, the planner knows it will have to do a label scan of MemberTopics, and it's internal statistics say that will come to the tune of ~3 million rows (and it is right about that). With an index on Membertopics, the planner's internal statistics said it would probably need ~2k rows if it used the index, and the actual outcome was ~70 million rows... oops! This is partially because the planner underestimated how many rows it would be working with by this point of the query, but I can't be sure exactly why it was off by so many orders of magnitude. If you remove the Member node from the cypher and change the return to RETURN *
, it looks like you would get the answer to that question though.
This is partially because your cypher/data itself is a monster to plan for.
WHERE mt.topic_id = gt.topic_id
is a foreign key reference, and should never appear in a cypher; and really is the lynch pin in why your performance is terrible. Either topic_id should be it's own node, or you should have a direct relationship between mt and gt. Neo4j is really good at walking relationship edges. It's terrible at foreign keys (compared to walking along relations).
As for what you can do about it,
to a single member, or group, or area
With creates a logical partition of the Cypher, and the planner will mostly try to resolve everything before the WITH before continuing with the rest of the logic. You know your data better than the planner, so this is a good way to limit the scope of your query without changing the results. (In your case, probably start with just Member and Venue, and filter from there.)
Replace mt.topic_id = gt.topic_id
with a relationship or intermediate node.