Search code examples
javasearchpattern-matchingtrieaho-corasick

Finding Patterns Occurrences In A Large Text File (Currently With Aho-Corasick)


I have a large text (5MB-500MB) file and a set of a few thousand patterns. For each pattern, I want to get the number of occurrences of the pattern in the file. The text contains no whitespaces and is a basic long alpha-numeric string.

For that purpose, I was trying to use the Aho-Corasick algorithm, specifically Robert-Bor's Java implementation, and it indeed works fast enough, but there is a problem: the result of counting the emits with the pattern as their string is not equal to the result of opening the text file with a text editor such as notepad++ and counting the pattern. It is important to me that the number of occurrences counted will be exactly the number of times the pattern found in the file. Therefore, I need to find a solution to this problem.

Is there a change I can make in the algorithm's implementation in order to fulfill my goal? Maybe an EmitHandler of some kind will solve my problem? I am also open to other suggestions such as replacing the algorithm/solution method. Yet, I want to stay with java if possible, and to get the results as fast as possible (the emits indices are not important to me, for example).


Edit: For example, even the small following text of an installation file: File Link, and the pattern:

5b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff55b4e5ff

which according to the emits count appears 150 times in the file, but only appears 10 times according to the count feature of Notepad++/Ctrl-f in a browser.

And another example on the same text:

f34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0ff34a6e0

appears 99 times according to the emits count, but only 10 times according to the count of a text editor.

Link to the implementation of the algorithm, here. What I currently run based on the implementation:

  Trie trie = Trie.builder().addKeywords(wordsMap.keySet())
                        .build();
    Collection<Emit> ls2 = trie.parseText(str);``
            for (Emit e: ls2) {
                if (!map.containsKey(e.getKeyword()))
                      map.put(e.getKeyword(),1);
                else {
                    int val = map.get(e.getKeyword());
                    map.replace(e.getKeyword(),val+1);
                }
            }
            return map;

Thanks!

I have also tried the non-overlapping option available with the implementation, but it doesn't fit the requirements and also too slow for my uses.


Solution

  • First of all, it's not clear how or why the algorithm doesn't suit your needs with regard to correctness when the Trie is built with ignoreOverlaps(). I take your word on this, though. I'm also ready to believe you when you say that there's an impact on performance in this case.

    So, instead of digging into the implementation of the algorithm, I'd better use it with overlaps, and then remove the overlaps manually. In this case, I think you will be able to fine-tune which emits to skip.

    Here's the code to initialize the Trie:

    String text = ... // read the text somewhere
    
    Set<String> keywords = new HashSet<>();
    keywords.add("keyword1");
    keywords.add("keyword2");
    
    Trie trie = Trie.builder().addKeywords(keywords).build(); // with overlaps!
    

    Now, let's parse the text:

    Collection<Emit> parseResults = trie.parseText(text);
    

    As far as I can tell, the parse results are returned in order of appearance within the text, but I haven't tested this thoroughly. For the code below to work properly, we need the emits to be sorted by start index.

    Given the emits are sorted by start index, here's the code to count non-overlapping emits per keyword:

    Map<String, Long> map = parseResults.stream()
        .collect(Collectors.groupingBy(Emit::getKeyword, countingNonOverlaps()));
    

    Where the countingNonOverlaps() utility method is as follows:

    private static Collector<Emit, ?, Long> countingNonOverlaps() {
    
        class Acc {
            Emit last;
            long count = 0;
    
            void add(Emit e) {
                if (last == null || !last.overlapsWith(e)) { count++; last = e; }
            }
    
            Acc merge(Acc another) {
                throw new UnsupportedOperationException("Parallel not supported");
            }
        }
        return Collector.of(Acc::new, Acc::add, Acc::merge, acc -> acc.count);
    }
    

    This approach uses a custom collector to count non-overlapping emits per keyword. There are other, simpler ways to do this without a custom collector, but they require to keep a list of the non-overlapping emits per keyword. As you only need the counts and as you're working with 2000 keywords and a huge text, I think this way is better.

    The collector basically keeps track of the last non-overlapping emit collected and increments a count for the current emit being collected only if it doesn't overlap with the last non-overlapping emit. Also, it only works for sequential streams.

    Note: if you need to fine-tune when the counter is incremented, you can customize the add method of the Acc local class.