How would you implement a solution to the following problem:
Given a string of symbols and patterns, that can be added at any time, count how often each pattern occurs.
Example:
There is a continuous stream of incoming symbols let’s say A B A C D E ...
The user can at any time register a new pattern for example (B A C). If the pattern was registered before the second timestep the pattern should be counted once.
In case of overlapping patterns only the first pattern should be counted e.g (B A C) and (A C D) would result only (B A C) being counted.
Solution approaches:
The trivial solution is to just keep one position per pattern, advance it when the pattern is matched and reset all positions once one is matched. This would lead to a runtime of O(n * m)
where n is the length of the stream and m is the length of the longest pattern (by having the same prefix of length m - 1 in all pattern for example).
The alternative approach would be to construct a finite automata and use the fact that pattern can have combined prefixes.
However there are a few problems with that:
How construct the edges between the patterns? (e.g. B D E from A B)
How to add patterns at runtime. E.g. let’s say the stream is A B and at the moment only the pattern (A B C) is registered. Now the user registers (B A C). If the stream continues with A C D E. The pattern should not be matched since the first symbol occurred before registering it.
The idea could be linked to Aho Corasick algorithm. However the algorithm does match all occurrences of the patterns and not only the first one. It does not allow for patterns to be added at runtime.
Maintain an initially-empty list of Aho-Corasick FSMs. Whenever a new pattern is registered, create a new FSM for just this pattern, append it to the list, and check whether there are now 2 single-string FSMs at the end of the list: if so, delete them, build a single new FSM for both strings, and put this FSM in place of the original 2. Now check whether there are 2 2-string FSMs, and combine them into a single 4-string FSM if so. Repeat this procedure of combining two k-string FSMs into a single (2k)-string FSM until all FSMs are for distinct numbers of strings. (Notice that any 2 FSMs for the same number of strings must be at adjacent positions in the list.)
Suppose n registrations occur in total. As a result of the above "compacting" procedure, the list will contain at most log2(n)+1 FSMs at all times, so the overall "cost factor" of using each of these FSMs to search the input stream (vs. a single FSM containing all strings) is O(log n). Also, the number of FSM-building processes that a particular string participates in is capped at log2(n)+1, since each new FSM that it participates in building is necessarily twice as large as the previous one that it participated in building. So the overall "cost factor" of building all the FSMs (vs. building a single FSM containing all strings) is also O(log n).