Search code examples
neo4jcyphercollect

neo4j cypher with different 'parts' returns nothing if one part is empty


I am fairly new to neo4j and sometimes struggling with understanding what's going on under the hood.
The main thing I want to do is avoiding double queries. Therefore I need to carry over already seen contracts from part 1 to part 2 to not look at them again and so on.
I tried to use UNION instead of COLLECT as described below, but that is much slower over all.
I have a query build out of four parts like this (simplified):

//PART1
MATCH (x1:X) -[:STARTS_AT]-> (somewhere) <-[:STARTS_AT]- (z1:Z)
MATCH 
   (x1) -[:ENDS_AT]-> (somewhereelse) <-[:ENDS_AT]- (z1)
WHERE somewhereconditions
WITH COLLECT(x contract_ids) as already_seen_contracts
   , COLLECT(all other stuff of interested from x and z) as taking_over

//PART2
MATCH (x2:X) -[:STARTS_AT]-> (somewhere) -[:IS_IN]-> (s2:S),
      (x2) -[:ENDS_AT]-> (somewhereelse) -[:IS_IN]-> (e2:S)
      path = allshortestpath(s2) CONNECTED (e2)

MATCH (z2:Z) -[:STARTS_AT]-> (somewhere) -[:IS_IN]-> (r:S),
        (z2) -[:ENDS_AT]-> (somewhereelse) -[:IS_IN]-> (t:S) 
WHERE z2 on path
AND x2.contracts not in already_seen_contracts

WITH already_seen + COLLECT(x contract_ids) as already_seen
   , taking_over +  COLLECT(all other stuff of interested from x and z) as taking_over

//PART3 and PART4 similar

UNWIND taking_over AS taking_over_unwind 
RETURN stuff from taking_over_unwind 

I hope that my clumsy attempt of simplifying made clear what the structure is I am using. I had the idea that with the COLLECT in each part I create kind of a container which I can add the new stuff from each part (using the first COLLECT and the NOT IN filter to avoid the double queries, which I had set it up this way in the first place), carry it over to the next part and return at the end all results.
Works fine as long as there are results for each part.
Now one part is empty (the others are not) and the whole query does not return anything at all at the end. I kind of did expect that it will still return the things found in the non-empty parts.

I tried the OPTIONAL MATCH, but that's not an option since I explicitly don't want the NULLs returned for example in the second MATCH in part1.

I have seen a solution using a dummy node, which I would have to add to the COLLECT every time to make sure there is at least one result?

Any ideas on how to do either avoiding double queries itself or a solution to the above where a null makes all results vanish are highly appreciated!

Also: why at all is this happening?

Thanks a lot for helping out!!

edited: Sample data: Having nodes X of some kind and Z of another kind with properties contractid and createdate.
I want to get out all

x1.contractids, z1.contractids, x1.createdates, z1.createdates
x2.contractids, z2.contractids, x2.createdates, z2.createdates

The '1s' x and z have the same start and end, for the '2s' z is along the route of x.


Solution

  • Try profiling your query, like this:

    PROFILE MATCH (x1:X) -[:STARTS_AT]-> (somewhere) <-[:STARTS_AT]- (z1:Z)
    MATCH 
       (x1) -[:ENDS_AT]-> (somewhereelse) <-[:ENDS_AT]- (z1)
    WHERE somewhereconditions
    WITH COLLECT(x contract_ids) as already_seen_contracts
       , COLLECT(all other stuff of interested from x and z) as taking_over
    
    MATCH (x2:X) -[:STARTS_AT]-> (somewhere) -[:IS_IN]-> (s2:S),
          (x2) -[:ENDS_AT]-> (somewhereelse) -[:IS_IN]-> (e2:S)
          path = allshortestpath(s2) CONNECTED (e2)
    
    MATCH (z2:Z) -[:STARTS_AT]-> (somewhere) -[:IS_IN]-> (r:S),
            (z2) -[:ENDS_AT]-> (somewhereelse) -[:IS_IN]-> (t:S) 
    WHERE z2 on path
    AND x2.contracts not in already_seen_contracts
    
    WITH already_seen + COLLECT(x contract_ids) as already_seen
       , taking_over +  COLLECT(all other stuff of interested from x and z) as taking_over
    
    UNWIND taking_over AS taking_over_unwind 
    RETURN stuff from taking_over_unwind 
    

    You will get the query execution graph for the query. Note that internally, neo4j passes data from stage to other in the form of rows, so when at one of the stages all the rows get filtered out due to some conditional check, the stages ahead won't do anything, and the result set will be empty.

    In your case, what you can do is break your query into multiple parts. Pass the primary key or unique identifiers of the nodes, that you want to reuse in other queries, and create some indexes on unique identifiers. In this way, performance will not be an issue, and you will get expected output from different parts, which can be combined at application level.