algorithm data-structures graph graph-theory

Efficiently find marked referenced records

I have

a few million records in a database that
reference each other (a directed acyclic graph). There are direct references (A -> B) and indirect references (if A -> B and B -> C, then A -> C). Indirect references can have any recursion depths, but in reality the depth is at most 100. This is very similar to objects in an object oriented language can reference other objects, recursively, except that cycles are not allowed.
A record can have between zero and 100 direct references.
Each record can be marked or not (most records are not marked).

Problem

I'm looking for an efficient data structure and algorithm to find all marked referenced (directly or indirectly referenced) records given a set of records (often just one, or up to 100). There are directly marked records (if a directly referenced record is marked), or indirectly marked records (if an indirectly referenced record is marked).

Reading the records is relatively slow, let's say 2 milliseconds per record.

I'm not looking for using a faster storage or similar here. I know it is possible, but it is quite hard to keep in sync. I'm trying to add a secondary data structure that contains just the relevant data. This will speed things up quite a bit (maybe factor of 10 or even 100), but bring a constant-factor improvement. I'm still interested in understanding if it's possible to improve the algorithm, if the amount of data grows.

Ideas

I have thought about the following options:

Brute force: One algorithm would be to search for all (directly or indirectly referenced) entries, and filter for marked entries. But that is slow, obviously, as I have to process all (directly or indirectly) referenced entries. Maybe none are marked, but 20'000 are referenced.
Shadow mark: Another algorithm would be to have a reverse index (which entries are referencing which other entries), and then each time an entry is marked, also "shadow-mark" all the entries that reference this entry, recursively. That way, when searching for marked entries, we can filter for those that have the "shadow-mark" set. The disadvantage is that many updates are needed if an entry is marked. A related option would be using a Bloom filter for shadow marking. But this would just reduce the memory usage.
Let's say we maintain a "maximum-depth" which is the maximum depth of a tree (the maximum number of hops from any record). And then we use the shadown-mark algorithm from above, but only partially: only up to maximum-depth / 2 recursion levels. So we limit propagating the shadow-mark. And then, for a query, we also limit the recursion depth to maximum-depth / 2. That way, we will "meet in the middle" in the worst case. (I should probably draw a picture.) A sub-problem is then how to efficiently maintain this maximum-depth.

I wonder, is there something similar to this approach? Something that doesn't require many updates when marking an entry, and doesn't require too many reads when querying? Or maybe a solution that allows to gradually update entries, if an entry is marked?

Example

In this example (blue is "marked"), for example if I search for (indirectly) referenced marked records for 5, I would like to quickly find 1 and 3.

Solution

You could keep a table on each node that records which marked nodes are reachable from it, and keep it updated whenever a node (or edge) is added or removed from the graph, similar to network routing tables are kept for each node in a network. There are a couple of specifics about your problem that make it simpler than a network routing table though:

You don't want to know the actual path to the marked nodes from a given node, only that one (or more) exists.
The graph is acyclic.
It's not a distributed system so you have full control (obviously ...).

Because you don't care about path and because the graph is acyclic, the table on each node can be a map marked_node_id -> count where count is the number of paths from the given node to the given marked-node. When a new node is added the new node's table is built as the union of all the nodes tables adjacent to the new node where count is the sum. Additionally, the tables of all nodes adjacent from the new node have to be updated by adding the new node's table to each of them, and this has to be done recursively up the adjacent from chain. When a node is deleted you have to do similar.

Basic complexity analysis: Finding all marked nodes of a given node is O(1) and can be done with info stashed on a single node - which is the whole point. In general, adding and removing an edge (or a new node plus its edges) will require updating tables of all connected nodes recursively (upto a call depth of 100 and branching factor upto 100). Building tables initially would be O(number-of-nodes) by reverse flooding from marked nodes.

Code Example:

This is abstract and in-code solution but should translate. I'm using Python (+GraphViz) because you didn't specify a language, it's probably most accessible to widest audience, and is easy to prototype in. I'm also going to only implement add/remove node operations (to modify a node can remove then add with different initialization) and build the graph from scratch which isn't really realistic, but you can build tables initially given an existing graph by working backwards from marked nodes pretty easily. Also note:

The following require each node to have/maintain an adjacent_from list in addition to adjacent_to list so we can recurse up the adjacent from paths when a given node is deleted.
I've assumed each marked node is reachable from itself - just makes things bit easier to implement.


def main():
  ''' Build a test graph, then test. '''
  graph = Graph()
  a = graph.add_node('a', marked=True)
  b = graph.add_node('b', marked=True)
  c = graph.add_node('c', marked=True)
  d = graph.add_node('d', adjacent_to=[a])
  e = graph.add_node('e', adjacent_to=[d])
  f = graph.add_node('f',adjacent_to=[c])
  g = graph.add_node('g', adjacent_to=[d,f])
  h = graph.add_node('h', adjacent_to=[e,g])
  i = graph.add_node('i')
  j = graph.add_node('j', marked=True, adjacent_to=[i])
  k = graph.add_node('k', adjacent_to=[j])
  l = graph.add_node('l', adjacent_to=[k])
  m = graph.add_node('m', adjacent_to=[j])
  with open('main0.dot', 'w') as f:
    f.write(graph.gviz())
  graph.delete_node('f')
  with open('main1.dot', 'w') as f:
    f.write(graph.gviz())
  graph.delete_node('e')
  with open('main2.dot', 'w') as f:
    f.write(graph.gviz())
  graph.delete_node('g')
  with open('main3.dot', 'w') as f:
    f.write(graph.gviz())
  # Run this script to process graphviz files: for i in *.dot; do dot -Tpng $i > "${i%%.dot}.png"; done

class Graph:
  ''' Container for nodes. '''
  def __init__(self):
    self.nodes = {}

  def add_node(self, id, marked=False, adjacent_to=[]):
    assert id not in self.nodes
    self.nodes[id] = Node(id, marked, adjacent_to)
    return self.nodes[id]

  def delete_node(self, id):
    assert id in self.nodes
    node = self.nodes[id]
    self._recursive_subtract_table_on_delete(node, node)
    for adjacent_from_node in node.adjacent_from:
      adjacent_from_node._remove_adjacent_node(node.id)
    del self.nodes[id]

  def _recursive_subtract_table_on_delete(self, node, deleted_node):
    for adjacent_from_node in node.adjacent_from:
      self._recursive_subtract_table_on_delete(adjacent_from_node, deleted_node)
    node._delete_reachability_table(deleted_node)

  def gviz(self):
    return 'strict digraph {\n%s}' % ''.join([n._gviz_edges() for n in self.nodes.values()])

class Node:
  def __init__(self, id, marked=False, adjacent_to = []):
    ''' Init node. Note only adjacent_to not adjacent_from node are allowed,
    which measn we dno't have to update adjacent_from reachable_marks.  '''
    self.id = id
    self.marked = marked
    self.adjacent_to = adjacent_to
    self.adjacent_from = []
    self.reachable_marks = {}

    if marked:
      self.reachable_marks[id] = 1
    for adjacent_node in adjacent_to:
      adjacent_node.adjacent_from.append(self);
      self._add_reachability_table(adjacent_node)

  def _add_reachability_table(self, node):
    ''' Add the reachable_marks table from node to self. '''
    for (marked_node_id, k) in node.reachable_marks.items():
      self.reachable_marks[marked_node_id] = self.reachable_marks[marked_node_id] + 1 if marked_node_id in self.reachable_marks else 1

  def _delete_reachability_table(self, node):
    ''' Delete the reachable_marks table from node from self. '''
    for (marked_node_id, k) in node.reachable_marks.items():
      self.reachable_marks[marked_node_id] = self.reachable_marks[marked_node_id] - 1 if marked_node_id in self.reachable_marks else 0
    self.reachable_marks = {k: v for k,v in self.reachable_marks.items() if v}

  def _remove_adjacent_node(self, id):
    self.adjacent_to = list(filter(lambda n: n.id != id, self.adjacent_to))

  def _gviz_edges(self):
    ''' Helper to print graphviz edges adjacent to this node. '''
    _str = ''
    if self.marked:
      _str += ' %s[style=filled,fillcolor=blue]\n' % (self._gviz_name(),)
    else:
      _str +=  self._gviz_name() + '\n'
    for adjacent_node in self.adjacent_to:
      _str += ' %s -> %s\n' % (self._gviz_name(), adjacent_node._gviz_name())
    return _str;

  def _gviz_name(self):
    ''' Helper to print graphviz name with reachable marks. '''
    return '"' + self.id + '(' + ','.join(self.reachable_marks.keys()) + ')"'

if __name__ == '__main__':
  main()

Results:

The output graph shows marked nodes reachable from each node in brackets.

Initial:

Remove node f:

Remove node e:

Remove node g: