Search code examples
c#data-structuresdictionarypathtrie

Efficient data structure for associating data with filesystem paths?


I need to keep some data in memory about a possibly large number of files and directories (typically up to a few hundreds of thousands). The obvious approach is to use a Dictionary<string, Something> with the path as the key, but there are two problems with this:

  • many of the files will have a large part of their path in common, so it's probably a waste of memory to store the full path of each file
  • I need to be able to quickly access the data about all the descendants of a directory; with a dictionary, the only way to do this is to test each key and check if it starts with the specified path, which is quite inefficient

This problem seems like a good candidate for using a prefix tree (or trie), with segments of the path as "characters". I tried to implement it, and the performance is not too bad for lookup by prefix (about 4 times faster than a dictionary), but it has two issues:

  • the memory consumption is not reduced, probably because of the overhead of the list of children for each node
  • the construction time is much worse than with a dictionary (about 4 times slower to fill the collection)

I'm sure it must be a very common problem, so perhaps there are well known solutions that I'm not aware of?


Solution

  • Just a few generic ideas:

    First, a Patricia trie is probably the best-known approach for improving the memory consumption of tries - it compacts paths in which all nodes have a single child into one node, and concatenate the characters along the path. There is also a version in which you look at the data as a sequence of binary digits, which has the advantage that you always have at most 2 child nodes, and it's also easier to implement.

    Second, the memory consumption really depends on how you store the children of a given node - do you maintain an array of 256 nodes? That's usually the most efficient way for direct lookup, but also consumes the most memory and is slow if you need to iterate through all the children. Other options are:

    • Store an array of pairs (letter, child node) - This is probably the most memory efficient, since it only stores the objects you actually care about, and it also has a good performance for iterating through all children. However, you have to check all pairs for direct lookup - which is usually fine farther from the root, but could be a problem near the root.

    • Store some kind of dictionary inside each node that maps a letter to a child node. That's the most balanced with respect to performance - it gives you reasonably good speeds for all operations and is somewhat memory efficient.

    Also, if you construct the whole collection up-front and then just query it, there is an approach for storing the child links based on Tarjan tables that will probably increase the construction time, but will save memory and query time later.