Search code examples

Streamlining Neo4j query that conditionally creates new relationships

I have a graph database with 3 type of nodes and two relationships:


I want to create a new relationship between the nodes labeled (:PERSON) such as:




subject to <>

So that I can represent competition for scarce labor in a variety of markets represented by (s:SKILLS).

The condition to establish the new relationship [:competes_with] is that 2 distinct persons nodes (:PERSON) manage companies that seek at least 3 (:SKILLS) profiles that coincide between the 2 companies.

Orders of magnitude are:

|(:PERSON)|  =  6000
|(:COMPANY)| = 15000
|(:SKILLS)|  = 95000

In my plodding way, what I did was:

MATCH (p1:PERSON)-[:manages]->(:COMPANY)-[:seeks]->(s:SKILLS)
WITH p1, collect(DISTINCT s.skill_names) AS p1_skills
MATCH (p2:PERSON)-[:manages]->(:COMPANY)-[:seeks]->(s:SKILLS)
WITH p1,p1_skills, p2, collect(DISTINCT s.skill_names) AS p2_skills
WHERE p1 <> p2
UNWIND p1_skills AS sought_skills
WITH p1,p2, sought_skills, reduce(com_skills=[], sought_skills IN p2_skills | com_skills + sought_skills) AS NCS
WHERE size(NCS) >= 3

Given the size of the problem, this causes a 14GB RAM box to crash after a while with an "out-of-memory" exception.

So, besides the fact that I don't know whether my query actually does what I want (it crashes before completing), the question is:

Can I streamline this to make it work with smaller memory requirements? What would the improved query be like?



    1. The standard neo4j naming convention is to have camel-case label names, and all-upper-case relationship names (and properties should start with a lower-case character). In this answer, I will follow the standard and use names like Person and MANAGES.
    2. You don't need 2 COMPETES_WITH relationships between the same 2 Person nodes if the relationship is inherently bidirectional. Neo4j can navigate incoming and outgoing relationships equally easily, and the MATCH clause allows a relationship pattern to not specify a direction (e.g., MATCH (a)-[:FOO]-(b)). Also, the MERGE clause (but not CREATE) allows you to specify an undirected relationship -- which ensures that only one relationship exists between the 2 endpoints.
    3. It seems that the COMPETES_WITH relationship really belongs between Company nodes, since that is really the source of the competition. Also, if a Person left a company, you should not have to remove any COMPETES_WITH relationships from that node (and you should also not have to add a COMPETES_WITH relationship to the replacement Person).
    4. In addition, you should consider whether the COMPETES_WITH relationship is really needed in the first place. Every time the skills sought by a Company changes, you'd have to recalculate its COMPETES_WITH relationships. You should determine whether doing that is worth it, or whether your queries should just dynamically determine a company's competitors as needed.
    5. Here is a simplified version of your original query:

      MATCH (p1:Person)-[:MANAGES]->(:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(:Company)<-[:MANAGES]-(p2:Person)
      WITH p1, p2, COUNT(s) AS num_skills
      WHERE num_skills >= 3

      To find the Person nodes that compete with a given Person:

      MATCH (p1:Person {id: 123})-[:COMPETES_WITH]-(p2:Person)
      RETURN p1, COLLECT(p2) AS competing_people;
    6. If you changed the data model to have the COMPETES_WITH relationship between Company nodes:

      MATCH (c1:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(c2:Company)
      WITH c1, c2, COUNT(s) AS num_skills
      WHERE num_skills >= 3

      With this model, to find the Person nodes that compete with a given Person:

      MATCH (p1:Person {id: 123})-[:MANAGES]->(:Company)-[:COMPETES_WITH]-(:Company)<-[:MANAGES]-(p2:Person)
      RETURN p1, COLLECT(p2) AS competing_people;
    7. If you did not have COMPETES_WITH relationships at all, to find the Person nodes that compete with a given Person:

      MATCH (p1:Person {id: 123})-[:MANAGES]->(:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(:Company)<-[:MANAGES]-(p2:Person)
      WITH p1, p2, COUNT(s) AS num_skills
      WHERE num_skills >= 3
      RETURN p1, COLLECT(p2) AS competing_people;