Search code examples
treesuffix-treesuffix-array

Modifying a Generalised Suffix Tree to hold number of times a node appears in the text string


How do I modify the procedure in Ukkonen's paper to hold a value for number of times a word appears in the text. Are there any such implementations available that provide the string frequency as well?

The modification I want is like for a string "hehe" the frequency count for all "h","e","he" should be 2 in the tree. Rest nodes have the default value 1.

I found some libraries like the best so far and some previous questions like this.

But none of them describes a good enough solution to my problem. Also I have to process a very large dictionary file (about a billion words). Then the algorithm needs to be really fast. And I am ready to compromize on space a bit.


Solution

  • The answer can be found here: Counting the number of substrings

    Basically, build the suffix tree, match the substring starting from the root and count the leaf nodes below that point. That's the number of times the word appears in the text.