Search code examples
neo4jcypherfaceted-search

Multiple links cypher neo4j optimization ( // Faceted search ?)


Before going further, here is a representation of my data model. I am stuck for the moment with Neo4J 1.9.2 and have a rather big database (~1 Million Nodes as far as I can tell, maybe less but will be growing over time when all data are ingested). Now that you have it in mind, lets explain what I mean by faceted search.

My items (documentaryUnit) are sometime linked to keywords (which can have different types). What I want to implement is a way to select few keywords and see if there is any node matching the requirements of being connected to keyword1, keyword2, etc.. I don't want to do what faceted search is mainly about, aka. showing number of possibilities and make it unable to query if there is 0 results, matching other possibilities. I just want to be able to do this "simple" query. Keep in mind I am quite new in the Neo4J world, tried to find an answer before but as I am lacking some conceptual things, might have missed the right post.

So, here is the query I tried :

    START 
    facet1 = node:entities("__ID__:keyword-104"),
    facet2 = node:entities("__ID__:place-1"),
    facet3 = node:entities("__ID__:keyword-2"),
    facet4 = node:entities("__ID__:keyword-258")
MATCH
    (elem)<-[:hasLinkTarget]-(link)-[:hasLinkTarget]->(facet1),
    (elem)<-[:hasLinkTarget]-(link)-[:hasLinkTarget]->(facet2),
    (elem)<-[:hasLinkTarget]-(link)-[:hasLinkTarget]->(facet3),
    (elem)<-[:hasLinkTarget]-(link)-[:hasLinkTarget]->(facet4)
WITH distinct elem, facet1, facet2, facet3, facet4, link
RETURN elem

With or without distinct, it takes ages and basically crash sometimes. With only two keywords, it works well ( < 100 ms). 3 is long, 4 crashes (more or less). I need to find a way to do it without using any external services (solr is not an option here for upgrading reasons).

Given the picture I attached, what I want is to find documentaryUnit like #1, attached to keyword 1,4,5,3 through a link. I tried with collection as well, doing so :

START doc = node:entities("__ISA__:documentaryUnit")
MATCH (doc)<-[:hasLinkTarget]-(link)-[:hasLinkTarget]->(accessPoints)
WITH collect(accessPoints.__ID__) AS accessPointsId, doc
WHERE ALL (x IN ['keyword-104', 'place-1', 'keyword-2']
           WHERE x IN accessPointsId)
RETURN doc.__ID__

which does not crash but takes a lot of basenode as a start entry points. Takes between 1000 ms and 2000 ms.

Thank you for reading this, will reply as soon as possible when you post something


Solution

  • Two solutions. The best one (around 500ms for caching, 270 ms afterwards) :

    START 
        accessPoints = node:entities("__ID__:kw-1 OR __ID__:kw-2 OR __ID__:kw-3 OR __ID__:kw-4")
    MATCH 
        (doc)<-[:hasLinkTarget]-(link)-[:hasLinkTarget]->accessPoints
    WHERE doc.__ISA__ = "documentaryUnit"
    WITH collect(accessPoints.__ID__) AS accessPointsId, doc
    WHERE ALL (x IN ['kw-1', 'kw-2', 'kw-3', 'kw-4']
               WHERE x IN accessPointsId)
    RETURN doc
    

    The second one 5000ms and 400 ms afterwards

    START 
        facet1 = node:entities("__ID__:kw-1"),
        facet2 = node:entities("__ID__:kw-2"),
        facet3 = node:entities("__ID__:kw-3"),
        facet4 = node:entities("__ID__:kw-4")
    MATCH
        (elem)<-[:hasLinkTarget]-()-[:hasLinkTarget]->facet1,
        (elem)<-[:hasLinkTarget]-()-[:hasLinkTarget]->facet2,
        (elem)<-[:hasLinkTarget]-()-[:hasLinkTarget]->facet3,
        (elem)<-[:hasLinkTarget]-()-[:hasLinkTarget]->facet4
    WHERE elem.__ISA__ = "documentaryUnit"
    RETURN elem
    

    Removing the parenthis gave me a way faster response.