Search code examples
pythondatabaseoptimizationneo4jcql

Neo4j trouble optimizing query with multiple optional matches


I thought I was getting the grasp of neo4j. It turns out I am not. I have a long query I'm running. When I run with any 2 of the optional matches it runs in like 20 seconds. But if I had any third optional match (doesn't seem to matter which one) it will take nearly 15 minutes to run. I can't quite figure it out. I understand (somewhat) that optional matches order matters because they take all of the already matched stuff before them and use those "rows" to check the optional match thus getting exponentially more costly with each one. I thought if I added carefully placed "with" statements between each one, I could try to filter only the things that are necessary into each one.

My optional matches don't really have THAT much to do with each other. I'd actually do it as 3-4 different neo4j queries but my boss wants me to do it all in one query. If it turns out that the performance is drastically better, I might end up defying his wishes. I'm going to give you the full query with a few names of things changed. It won't affect the query or anything, my work is technically open source but I'm still not supposed to share anything identifiable.

I also ran "Profile" to show the full tree.

profile
match (ds:Analysis)<-[:OUTPUT]-(a)<-[:INPUT]-(firstSample:Sample)<-[*]-(source:Source)
with ds, firstSample

optional match (ds)<-[*]-(othersample:Sample)
with ds, othersample, firstSample
where not othersample.location is null and not trim(othersample.location) = ''

optional match (source)-[:INPUT]->(oa)-[:OUTPUT]->(specialsample:Sample {sample_type:'protein'})-[*]->(ds)
with ds, othersample, firstSample, source, specialsample

optional match (ds)<-[*]-(finalsample:Sample)
with ds, othersample, firstSample, source, specialsample, finalsample
where not finalsample.metadata is null and not trim(finalsample.metadata) = ''


return ds.id, collect(distinct firstSample), collect(distinct source), collect(distinct othersample), collect(distinct specialsample), ds.alt_id, ds.status, ds.group_name, ds.group_uuid, 
ds.created_timestamp, ds.created_email, ds.last_modified_timestamp, ds.last_modified_email, ds.lab_id, ds.data_types, collect(distinct finalsample)

This is hooking into a python script already written so I don't really have flexibility with the outputs or even the order that they are returned, but if necessary I might be able to do something about it.

Any advice would be appreciated. https://i.sstatic.net/9psgT.png


Solution

  • A couple of things I would try.

    1. Aggregate early in your query instead of waiting until the end.

    2. Use pattern comprehensions instead of OPTIONAL MATCH where you can.

    This might get you started.

    match (ds:Analysis)<-[:OUTPUT]-(a)<-[:INPUT]-(firstSample:Sample)<-[*]-(source:Source)
    
    WITH ds, 
    collect(distinct firstSample) as firstSamples, 
    collect(distinct source) as sources
    
    UNWIND sources as source
    
    OPTIONAL MATCH (source)-[:INPUT]->(oa)-[:OUTPUT]->(specialsample:Sample {sample_type:'protein'})-[*]->(ds)
    
    WITH ds, 
    firstSamples, 
    collect(distinct source) AS sources, 
    collect(distinct specialsample) AS specialSamples
    
    RETURN ds.id, ds.alt_id, ds.status, ds.group_name, ds.group_uuid, 
    ds.created_timestamp, ds.created_email, ds.last_modified_timestamp, 
    ds.last_modified_email, ds.lab_id, ds.data_types, 
    firstSamples, 
    sources,
    specialSamples,
    
    apoc.coll.toSet([(ds)<-[*]-(othersample:Sample) 
    where not othersample.location is null 
    and not trim(othersample.location) = '' | othersample]) 
    AS otherSamples,
    
    apoc.coll.toSet([(ds)<-[*]-(finalsample:Sample) 
    where not finalsample.metadata is null 
    and not trim(finalsample.metadata) = '' | finalsample]) 
    AS finalSamples
    

    You can see in the profile trace that the finalsamples portion of the query involves lots of rows. Are you sure your logic is correct in what you are asking for there?