I have a practical problem to do with finding any strings that form the beginning of a given query string and I can't find a suitable method for doing it quickly at scale. (specifically not trying to find strings that themselves begin with the query, it's the other way round).
So for example I would have a dataset like this:
Does she drink coffee
They speak English
They speak Turkish
They speak
What colour is the jumper
And if a query was given such as They speak English at work
then I would want to match They speak English
and They speak
from the dataset, and a query of They speak Turkish at home
would want to match They speak Turkish
and They speak
.
Everything I have used for string indexing previously works the other way round. E.g. a dataset of complete strings and the query being the sub-string.
In terms of the dataset:
I've considered a simple brute force check but it's too slow for my use case. I ideally need something that is doable in <10ms on fairly standard hardware.
I've looked suffix trees but unless I'm misunderstanding (which I very much might be I'm more a web dev than a computer scientist) they wouldn't be suitable for this use case.
The closest thing I've thought of is to trim all of my dataset to the smallest length and put them in a hashmap, then use this as a way to limit the amount of searches done to only entries that are guaranteed to have the same initial 13 characters. The problem with that is it won't be at all balanced as there is some level of duplication at the beginning of strings. Some of my map entries (tested on prod data) would have >100k entries and would lead to inconsistent performance particularly as the set grows.
A view millions should fit in RAM so we can create a data structure that stays in RAM.
I would create a prefix-tree (based on words, not characters) and than traverse the tree with the query as long as possible. All prefixes along the way are matches.
What colour is the jumper
/
+ Does - she - drink - coffee*
\
\ English*
They - speak*<
Turkish*
P.s. * marks end of an valid entry
Finding all matches would be O(n)
(n is length of request)