I’ve a simple Vertex „url“:
schema.vertexLabel('url').partitionKey('url_fingerprint', 'prop1').properties("url_complete").ifNotExists().create()
And a edgeLabel called „links“ which connects one url to another.
schema.edgeLabel('links').properties("prop1", 'prop2').connection('url', 'url').ifNotExists().create()
It’s possible that one url has millions of incoming links (e.g. front page of ebay.com from all it’s subpages).
But that seems to result in really big partitions / and a crash of dse because of wide partitions (From Opscenter wide partitions report): graphdbname.url_e (2284 mb)
How can i avoid that situation? How to handle this „Supernodes“? I’ve found a „partition“ command (article about this [1]) for Labels but that is set deprecated and will be removed in DSE 6.0 / the only hint in release notes is to model the data on another way - but i’ve no idea how i can do that in that case.
I’m happy about every hint. Thanks!
[1] https://www.experoinc.com/post/dse-graph-partitioning-part-2-taming-your-supernodes
The current recommendation is to use the concept of "bucketing" that drives data model design in the C* world and apply that to the graph by creating an intermediary Vertex that represents groups of links.
2 Vertex Labels
2 Edges
This solution requires bookkeeping to determine when a new group is needed. This could be done through a simple C* table for fast, easy retrievable.
CREATE TABLE lookup url_fingerprint, group, count counter PRIMARY KEY (url_fingerprint, group)
This should preserve DESC order, may need to add an ORDER BY statement if DESC order is not preserved.
Prior to writing to the Graph, one would need to read the table to find the latest group.
SELECT url_fingerprint, group, count from lookup LIMIT(1)
If the counter is > 100kish, create a new group (increment group +1). During or after writing a new row to Graph, one would need to increment the counter.
Traversing would require something similar to:
g.V().has(some url).out(URL).out(URL_Group).in(URL)
Where conceptually you would traverse the relationships like URL -> URL_Group->URL_Group<-URL
The visual model of this type of traversal would look like the following diagram