I thought I was getting the grasp of neo4j. It turns out I am not. I have a long query I'm running. When I run with any 2 of the optional matches it runs in like 20 seconds. But if I had any third optional match (doesn't seem to matter which one) it will take nearly 15 minutes to run. I can't quite figure it out. I understand (somewhat) that optional matches order matters because they take all of the already matched stuff before them and use those "rows" to check the optional match thus getting exponentially more costly with each one. I thought if I added carefully placed "with" statements between each one, I could try to filter only the things that are necessary into each one.
My optional matches don't really have THAT much to do with each other. I'd actually do it as 3-4 different neo4j queries but my boss wants me to do it all in one query. If it turns out that the performance is drastically better, I might end up defying his wishes. I'm going to give you the full query with a few names of things changed. It won't affect the query or anything, my work is technically open source but I'm still not supposed to share anything identifiable.
I also ran "Profile" to show the full tree.
profile
match (ds:Analysis)<-[:OUTPUT]-(a)<-[:INPUT]-(firstSample:Sample)<-[*]-(source:Source)
with ds, firstSample
optional match (ds)<-[*]-(othersample:Sample)
with ds, othersample, firstSample
where not othersample.location is null and not trim(othersample.location) = ''
optional match (source)-[:INPUT]->(oa)-[:OUTPUT]->(specialsample:Sample {sample_type:'protein'})-[*]->(ds)
with ds, othersample, firstSample, source, specialsample
optional match (ds)<-[*]-(finalsample:Sample)
with ds, othersample, firstSample, source, specialsample, finalsample
where not finalsample.metadata is null and not trim(finalsample.metadata) = ''
return ds.id, collect(distinct firstSample), collect(distinct source), collect(distinct othersample), collect(distinct specialsample), ds.alt_id, ds.status, ds.group_name, ds.group_uuid,
ds.created_timestamp, ds.created_email, ds.last_modified_timestamp, ds.last_modified_email, ds.lab_id, ds.data_types, collect(distinct finalsample)
This is hooking into a python script already written so I don't really have flexibility with the outputs or even the order that they are returned, but if necessary I might be able to do something about it.
Any advice would be appreciated. https://i.sstatic.net/9psgT.png
A couple of things I would try.
Aggregate early in your query instead of waiting until the end.
Use pattern comprehensions instead of OPTIONAL MATCH
where you can.
This might get you started.
match (ds:Analysis)<-[:OUTPUT]-(a)<-[:INPUT]-(firstSample:Sample)<-[*]-(source:Source)
WITH ds,
collect(distinct firstSample) as firstSamples,
collect(distinct source) as sources
UNWIND sources as source
OPTIONAL MATCH (source)-[:INPUT]->(oa)-[:OUTPUT]->(specialsample:Sample {sample_type:'protein'})-[*]->(ds)
WITH ds,
firstSamples,
collect(distinct source) AS sources,
collect(distinct specialsample) AS specialSamples
RETURN ds.id, ds.alt_id, ds.status, ds.group_name, ds.group_uuid,
ds.created_timestamp, ds.created_email, ds.last_modified_timestamp,
ds.last_modified_email, ds.lab_id, ds.data_types,
firstSamples,
sources,
specialSamples,
apoc.coll.toSet([(ds)<-[*]-(othersample:Sample)
where not othersample.location is null
and not trim(othersample.location) = '' | othersample])
AS otherSamples,
apoc.coll.toSet([(ds)<-[*]-(finalsample:Sample)
where not finalsample.metadata is null
and not trim(finalsample.metadata) = '' | finalsample])
AS finalSamples
You can see in the profile trace that the finalsamples portion of the query involves lots of rows. Are you sure your logic is correct in what you are asking for there?