I have the following setup:
(If this sounds familiar, that is how a Docker registry stores data on disk.)
What do I want to achieve?
I want to have the sum of size_in_mb for each repository, but count each blob only once despite the fact that it can be referenced by many tags.
Let us see following example:
CREATE (b1:Blob {name:"b1", size_in_mb: 1000})
CREATE (b2:Blob {name: "b2", size_in_mb: 100})
CREATE (r1:Repository {name: 'r1'})
CREATE (r2:Repository {name: 'r2'})
CREATE (t1:Tag {name: 'r1:latest'})
CREATE (t2:Tag {name: 'r1:old'})
CREATE (t3:Tag {name: 'r2:latest'})
CREATE (t1)-[:TAG_OF]->(r1)
CREATE (t2)-[:TAG_OF]->(r1)
CREATE (t3)-[:TAG_OF]->(r2)
CREATE (b1)-[:TAGGED_BY]->(t1)
CREATE (b1)-[:TAGGED_BY]->(t2)
CREATE (b1)-[:TAGGED_BY]->(t3)
CREATE (b2)-[:TAGGED_BY]->(t2)
We have
MATCH(r:Repository)<--(t:Tag)<--(b:Blob)
RETURN r,t,b
A simple sum
MATCH(r:Repository)<--(t:Tag)<--(b:Blob)
RETURN r.name, sum(b.size_in_mb)
returns
r.name sum(b.size_in_mb)
"r1" 2100
"r2" 1000
but want to have
r.name sum(b.size_in_mb)
"r1" 1100
"r2" 1000
because blob b1 and b2 shold be only counted once for repository r1.
How should I phrase my Cypher query to reach that goal?
I think I got it, based on Michaels answer:
MATCH(r:Repository)<--(t:Tag)<--(b:Blob)
with r, collect(distinct b) as distinctBlobs
RETURN r.name, reduce(totalSum = 0, n IN distinctBlobs | totalSum + n.size_in_mb) as size_sum
Not sure if this is the optimal solution but it does produce the correct values.