Search code examples
algorithmperformancesubstring

Fast way to find substrings of a query


I have a practical problem to do with finding any strings that form the beginning of a given query string and I can't find a suitable method for doing it quickly at scale. (specifically not trying to find strings that themselves begin with the query, it's the other way round).

So for example I would have a dataset like this:

Does she drink coffee
They speak English
They speak Turkish 
They speak
What colour is the jumper

And if a query was given such as They speak English at work then I would want to match They speak English and They speak from the dataset, and a query of They speak Turkish at home would want to match They speak Turkish and They speak.

Everything I have used for string indexing previously works the other way round. E.g. a dataset of complete strings and the query being the sub-string.

In terms of the dataset:

  • Size is in the tens of millions (and growing, it will never reduce)
  • Every string in the set is unique
  • The strings are of variable length between 13 and 2,000 characters
  • 13 is a hard limit, no query can be fewer characters than this

I've considered a simple brute force check but it's too slow for my use case. I ideally need something that is doable in <10ms on fairly standard hardware.

I've looked suffix trees but unless I'm misunderstanding (which I very much might be I'm more a web dev than a computer scientist) they wouldn't be suitable for this use case.

The closest thing I've thought of is to trim all of my dataset to the smallest length and put them in a hashmap, then use this as a way to limit the amount of searches done to only entries that are guaranteed to have the same initial 13 characters. The problem with that is it won't be at all balanced as there is some level of duplication at the beginning of strings. Some of my map entries (tested on prod data) would have >100k entries and would lead to inconsistent performance particularly as the set grows.


Solution

  • A view millions should fit in RAM so we can create a data structure that stays in RAM.

    I would create a prefix-tree (based on words, not characters) and than traverse the tree with the query as long as possible. All prefixes along the way are matches.

       What colour is the jumper
      / 
     +  Does - she - drink - coffee*
      \
       \               English*
        They - speak*< 
                       Turkish*
    

    P.s. * marks end of an valid entry

    Finding all matches would be O(n) (n is length of request)