Search code examples
neo4jcypher

Sum a numeric properties of leafs counting leafs only once despite multiple paths


I have the following setup:

  • I have multiple repositories.
  • A repository has many tags.
  • A tag points to many blobs.
  • Different tags can point to the same blob; hence, a blob can be referenced by many tags.
  • A blob has the property size_in_mb which is a numeric value.

(If this sounds familiar, that is how a Docker registry stores data on disk.)

What do I want to achieve?

I want to have the sum of size_in_mb for each repository, but count each blob only once despite the fact that it can be referenced by many tags.

Let us see following example:

CREATE (b1:Blob {name:"b1", size_in_mb: 1000})
CREATE (b2:Blob {name: "b2", size_in_mb: 100})
CREATE (r1:Repository {name: 'r1'})
CREATE (r2:Repository {name: 'r2'})
CREATE (t1:Tag {name: 'r1:latest'})
CREATE (t2:Tag {name: 'r1:old'})
CREATE (t3:Tag {name: 'r2:latest'})
CREATE (t1)-[:TAG_OF]->(r1)
CREATE (t2)-[:TAG_OF]->(r1)
CREATE (t3)-[:TAG_OF]->(r2)
CREATE (b1)-[:TAGGED_BY]->(t1)
CREATE (b1)-[:TAGGED_BY]->(t2)
CREATE (b1)-[:TAGGED_BY]->(t3)
CREATE (b2)-[:TAGGED_BY]->(t2)

We have

MATCH(r:Repository)<--(t:Tag)<--(b:Blob)
RETURN r,t,b

Graph Visualization

A simple sum

MATCH(r:Repository)<--(t:Tag)<--(b:Blob)
RETURN r.name, sum(b.size_in_mb)

returns

r.name  sum(b.size_in_mb)
"r1"    2100
"r2"    1000

but want to have

r.name  sum(b.size_in_mb)
"r1"    1100
"r2"    1000

because blob b1 and b2 shold be only counted once for repository r1.

How should I phrase my Cypher query to reach that goal?


Solution

  • I think I got it, based on Michaels answer:

    MATCH(r:Repository)<--(t:Tag)<--(b:Blob) 
    with r, collect(distinct b) as distinctBlobs 
    RETURN r.name,  reduce(totalSum = 0, n IN distinctBlobs | totalSum + n.size_in_mb) as size_sum
    

    Not sure if this is the optimal solution but it does produce the correct values.