How to (Efficiently) Find Subpaths in Recursive Trees through Cypher (Graph vs Regex)

I'm trying to reproduce a Go SGF game tree in Neo4j through Cypher so I can pattern-search through, as a graph. This is the shape of a Go game tree — based on Sabaki's SGF Parser —:

type GameTree = {
  id: number;
  data: { [key: string]: string[] };
  parentId: number | null;
  children: GameTree[];

You can find a reproducible sample of the problem in this commit.

So far, I've successfully managed to model it just like it appears in a Go editor's game tree:

Game Tree on Neo4j Desktop

B and W mean Black and White moves. AB and AW are "added black or white stones", for when editing the board position. For this question's purposes, only move matters actually, which has the same value as either B or W. Board coordinates are given in 2 letters, e.g. column m and row r gives out the string 'rm'.

Now, what I would like to do is return the full path (probably the root node would suffice actually) from the root if a specified subpath (no skipping) is found. So far, I have this, which does work:

MATCH p=(g:GameNode)-[:NEXT_MOVE*]->()

WITH g, p,
     [m in TAIL(NODES(p)) | m.move] AS moves

WITH g, p,
     REDUCE(path = '', move in moves | path + move) AS joined_moves

WHERE joined_moves CONTAINS "rmro"

RETURN g, p, joined_moves

Following @cybersam's answer, I think this could make it more explicit to Neo4j so it uses an index on m.move — e.g. CREATE INDEX move_node_move_idx FOR (m:MoveNode) ON (m.move) —:

MATCH p=(g:GameNode)-[:NEXT_MOVE*]->(m1:MoveNode)-[:NEXT_MOVE*]->()

WHERE m1.move = HEAD(['rm', 'ro'])

WITH g, p,
     [m in TAIL(NODES(p)) | m.move] AS moves

WITH g, p,
     REDUCE(path = '', move in moves | path + move) AS joined_moves

WHERE joined_moves CONTAINS "rmro"

RETURN g, p, joined_moves

But I really don't know if this type of pattern-search would be efficient enough through Neo4j or other graph databases. I think it could be since, even though most games have 200+ move branches, there aren't that many of them usually. And the analysis would be on unidirectional branches where order matters (no skipping), so it should probably limit the search space even more.

Alternatively, I'm considering placing a string with the full path to it in each move node. This way I could use Regexes (I know Regexes can also be modeled as graphs) instead of path searches. That said, if it works, then I think using a SQL database would be probably enough. However, I would probably stick to Neo4j still, cause modeling this as a graph could provide more useful features in the future.



  • The query below should be fairly performant in returning the moves in every game that contains the desired sequence of moves. It assumes you:

    • Have an index on MoveNode.move,
    • Pass the list of desired moves in the $moves parameter,
    • Adjust the length of the variable-length relationship in the first MATCH to be 1 less than the length of $moves. For example, if $moves has 2 items, use *1 instead of *2. This is admittedly ugly, but Cypher does not support dynamic bounds.


    MATCH p1=(m1:MoveNode)-[:NEXT_MOVE*2]->(m2)
    WHERE m1.move = HEAD($moves) AND [m IN TAIL(NODES(p1)) | m.move] = TAIL($moves)
    MATCH p2 = (m2)-[:NEXT_MOVE*0..]->(end) WHERE NOT (end)-[:NEXT_MOVE]->()
    MATCH p3 = (g:GameNode)-[:NEXT_MOVE*]->(m1)
    RETURN [a IN NODES(p3)[1..-1] | a.move] + $moves + [b IN NODES(p2)[1..] | b.move]

    The index is used to limit and anchor the search by quickly finding the m1 nodes (the MoveNodes matching the first item in $moves), instead of wading through all paths from every GameNode. Then the query:

    • Filters m1 by requiring that it is followed by a subpath satisfying the rest of the $moves list,
    • Finds the sequence of moves that follow each subpath,
    • Works backwards from the surviving m1 nodes to find the corresponding GameNodes.
    • Constructs and returns the sequence of moves for each matching game.

    So, this query starts off by using the index to find the first set of relevant nodes, and keeps adding relationships (and end nodes) to what was already found until it finally has the desired results. There is never a need to scan through irrelevant data.