Search code examples
gremlinjanusgraph

Efficient count of incoming edges Janusgraph


Goal:

Sort the vertices of a given label by the descending number of incoming edges of a given type in the most efficient way.

Configuration:

  • Ubuntu VM, 23Go RAM
  • JanusGraph 0.6.1 full
  • local graph (default conf/remote.yaml file used)
  • ~400k vertices (~2k with the label)
  • ~1.5m relationships (~1m with the type)
  • some indexing work has been done, nothing on the relationships though

What I am doing:

g.V().hasLabel(<label>).
    order().
        by(inE(<type>).count(),desc).
    limit(10).
    project("name","score").
        by(<property_name>).
        by(inE(<type>).count())

Issue:

While this query gives the expected results, it is really slow (up to 7+ minutes) and this execution time is not affordable. Is there a way to improve it? Whether it is an improvement to the query itself, or adding an index somewhere that could help...

What I have seen:

  • I have looked into composite indexes and mixed indexes and they seem to not affect my problem
  • I have considered the vertex-centric indexes: I think it wouldn't be useful here as I don't want a subset of the incoming edges of type type, but the total number of them. Would it still be a beneficial side-effect (assuming I understand them correctly) of the vertex-centric indexes?

Thanks to everyone reading/replying!

Solutions:

  • Simply changing the query to the one as rewritten in @KelvinLawrence's answer makes it quite a bit faster! Without any other changes, the execution time went from 7+ minutes to ~2 minutes.
  • Once the vertex_label property has been added and indexed on, and changing the hasLabel(<label>) into has("vertex_label", <label>), the execution time went down again from 2 minutes to ~5 seconds. The vertex_label property doesn't have any other use but to enable indexing on the type of vertex, JanusGraph not supporting indexing on labels, it provides a workaround.

Solution

  • It seems that perhaps your query is doing quite a bit of repeated work there. What about something like this:

    g.V().hasLabel('<label>').
          group().
            by('<key>').
            by(inE('<type>').count()).
          order(local).
            by(values,desc).
          unfold().
          limit(10)
    

    From memory, I don't think JanusGraph supports indexing on a label which may be part of the problem here. If so, storing the label also as a property, and creating an index on that property may help the initial finding of the 2K vertices.

    UPDATED 2022-06-21 To show an actual example.

    Using the air-routes data set, the query might look like this:

    gremlin> g.V().hasLabel('airport').
    ......1>       group().
    ......2>         by('code').
    ......3>         by(inE('route').count()).
    ......4>       order(local).
    ......5>         by(values,desc).
    ......6>       unfold().
    ......7>       limit(10)   
      
    ==>FRA=307
    ==>IST=307
    ==>CDG=294
    ==>AMS=284
    ==>MUC=271
    ==>ORD=263
    ==>DFW=251
    ==>PEK=249
    ==>DXB=247
    ==>ATL=242