I am fairly new to neo4j and sometimes struggling with understanding what's going on under the hood.
The main thing I want to do is avoiding double queries. Therefore I need to carry over already seen contracts from part 1 to part 2 to not look at them again and so on.
I tried to use UNION instead of COLLECT as described below, but that is much slower over all.
I have a query build out of four parts like this (simplified):
//PART1
MATCH (x1:X) -[:STARTS_AT]-> (somewhere) <-[:STARTS_AT]- (z1:Z)
MATCH
(x1) -[:ENDS_AT]-> (somewhereelse) <-[:ENDS_AT]- (z1)
WHERE somewhereconditions
WITH COLLECT(x contract_ids) as already_seen_contracts
, COLLECT(all other stuff of interested from x and z) as taking_over
//PART2
MATCH (x2:X) -[:STARTS_AT]-> (somewhere) -[:IS_IN]-> (s2:S),
(x2) -[:ENDS_AT]-> (somewhereelse) -[:IS_IN]-> (e2:S)
path = allshortestpath(s2) CONNECTED (e2)
MATCH (z2:Z) -[:STARTS_AT]-> (somewhere) -[:IS_IN]-> (r:S),
(z2) -[:ENDS_AT]-> (somewhereelse) -[:IS_IN]-> (t:S)
WHERE z2 on path
AND x2.contracts not in already_seen_contracts
WITH already_seen + COLLECT(x contract_ids) as already_seen
, taking_over + COLLECT(all other stuff of interested from x and z) as taking_over
//PART3 and PART4 similar
UNWIND taking_over AS taking_over_unwind
RETURN stuff from taking_over_unwind
I hope that my clumsy attempt of simplifying made clear what the structure is I am using. I had the idea that with the COLLECT in each part I create kind of a container which I can add the new stuff from each part (using the first COLLECT and the NOT IN filter to avoid the double queries, which I had set it up this way in the first place), carry it over to the next part and return at the end all results.
Works fine as long as there are results for each part.
Now one part is empty (the others are not) and the whole query does not return anything at all at the end. I kind of did expect that it will still return the things found in the non-empty parts.
I tried the OPTIONAL MATCH, but that's not an option since I explicitly don't want the NULLs returned for example in the second MATCH in part1.
I have seen a solution using a dummy node, which I would have to add to the COLLECT every time to make sure there is at least one result?
Any ideas on how to do either avoiding double queries itself or a solution to the above where a null makes all results vanish are highly appreciated!
Also: why at all is this happening?
Thanks a lot for helping out!!
edited: Sample data:
Having nodes X of some kind and Z of another kind with properties contractid and createdate.
I want to get out all
x1.contractids, z1.contractids, x1.createdates, z1.createdates
x2.contractids, z2.contractids, x2.createdates, z2.createdates
The '1s' x and z have the same start and end, for the '2s' z is along the route of x.
Try profiling your query, like this:
PROFILE MATCH (x1:X) -[:STARTS_AT]-> (somewhere) <-[:STARTS_AT]- (z1:Z)
MATCH
(x1) -[:ENDS_AT]-> (somewhereelse) <-[:ENDS_AT]- (z1)
WHERE somewhereconditions
WITH COLLECT(x contract_ids) as already_seen_contracts
, COLLECT(all other stuff of interested from x and z) as taking_over
MATCH (x2:X) -[:STARTS_AT]-> (somewhere) -[:IS_IN]-> (s2:S),
(x2) -[:ENDS_AT]-> (somewhereelse) -[:IS_IN]-> (e2:S)
path = allshortestpath(s2) CONNECTED (e2)
MATCH (z2:Z) -[:STARTS_AT]-> (somewhere) -[:IS_IN]-> (r:S),
(z2) -[:ENDS_AT]-> (somewhereelse) -[:IS_IN]-> (t:S)
WHERE z2 on path
AND x2.contracts not in already_seen_contracts
WITH already_seen + COLLECT(x contract_ids) as already_seen
, taking_over + COLLECT(all other stuff of interested from x and z) as taking_over
UNWIND taking_over AS taking_over_unwind
RETURN stuff from taking_over_unwind
You will get the query execution graph for the query. Note that internally, neo4j passes data from stage to other in the form of rows
, so when at one of the stages all the rows
get filtered out due to some conditional check, the stages ahead won't do anything, and the result set will be empty.
In your case, what you can do is break your query into multiple parts. Pass the primary key
or unique identifiers
of the nodes, that you want to reuse in other queries, and create some indexes on unique identifiers. In this way, performance will not be an issue, and you will get expected output from different parts, which can be combined at application level.