Search code examples
neo4jcypherbacon-number

Finding actors not connected to Kevin Bacon, efficiently


Using neo4j cypher, what query would efficiently find actors who are not connected to Kevin Bacon? We can say that 'not connected' means that an actor is not connected to Kevin Bacon by at least 10 hops for simplicity.

Here is what I have attempted:

MATCH (kb:Actor {name:'Kevin Bacon'})-[*1..10]-(h:Actor) with h 
MATCH (a)-[:ACTS_IN]->(m) 
WHERE a <> h 
RETURN DISTINCT h.name

However, this query runs for 3 days. How can I do this more efficiently?


Solution

  • (A) Your first MATCH finds every actor that is connected within 10 hops to Kevin Bacon. The result of this clause is a number (M) of rows (and if an actor is connected in, say, 7 different ways to Kevin, then that actor is represented in 7 rows).

    (B) Your second MATCH finds every actor that has acted in a movie. If this MATCH clause were standalone, then it would require N rows, where N is the number of ACTS_IN relationships (and if an actor acted in, say, 9 movies, then that actor would be represented in 9 rows). However, since the clause comes right after another MATCH clause, you get a cartesian product and the actual number of result rows is M*N.

    So, your query requires a lot of storage and performs a (potentially large) number of redundant comparisons, and your results can contain duplicate names. To reduce the storage requirements and the number of actor comparisons (in your WHERE clause): you should cause the results of A and B to have distinct actors, and eliminate the cartesian product.

    The following query should do that. It first collects a single list (in a single row) of every distinct actor that is connected within 10 hops to Kevin Bacon (as hs), and then finds all (distinct) actors not in that collection:

    MATCH (kb:Actor {name:'Kevin Bacon'})-[*..10]-(h:Actor)
    WITH COLLECT(DISTINCT h) AS hs
    MATCH (a:Actor) 
    WHERE NOT a IN hs
    RETURN a.name;
    

    (This query also saves even more time by not bothering to test whether an actor has acted in a movie.)

    The performance would still depend on how long it takes to perform the variable length path search in the first MATCH, however.