Search code examples
regexsearchtreeglob

How to use a regular expression (glob) to search a file tree


How do I adapt a search tree to handle limited regular expressions?

Given a file name, I need to find all nodes matching that file name. Nodes may contain usual file name globs (* and ?). Since this is a search tree, speed is of the essence.

I should add that the most important case for speed is the average time to rule out a match. In most cases matching will fail.

If the tree contained the following nodes:

foo, bar, foo*, *bar, foo?bar 
  • Searching for "foo" would return nodes 1 and 3.
  • Searching for "bar" would return nodes 2 and 4.
  • Searching for "fob" would return no nodes.
  • Searching for "fooxbar" would return node 5.
  • Searching for "foobar" would return nodes 3 and 4.

Solution

  • An Aho-Corasick search tree would fit the bill. "Tries" is a very good article about this sort of thing and the Etrie implementation used in Evolution to replace regex searching.

    To do the whole string matching, you can add beginning and ending anchor states. If scanning multi-line data, you could add the newline to the begin and end. You could also remove the part where it adds the cross linking for the partial matching starting a different match. This also allows faster exclusion.

    Another algorithm for checking for membership in a string set is CritBit. This doesn't have Regex, but it is simple and tests complete strings.