Search code examples
algorithmhashtable

Hash Table and Substring Matching


I have hundreds of keys for example like:

  • redapple
  • maninred
  • foraman
  • blueapple

i have data related to these keys, data is a string and has related key at the end.

  • redapple: the-tree-has-redapple
  • maninred: she-saw-the-maninred
  • foraman: they-bought-the-present-foraman
  • blueapple: it-was-surprising-but-it-was-a-blueapple

i am expected to use hash table and hash function to record the data according to keys and i am expected to be able to retieve data from table.

i know to use hash function and hash table, there is no problem here.

But;

i am expected to give the program a string which takes place as a substring and retrieve the data for the matching keys.

For example:

i must give "red" and must be able to get

  • redapple: the-tree-has-redapple
  • maninred: she-saw-the-maninred

as output.

or

i must give "apple" and must be able to get

  • redapple: the-tree-has-redapple
  • blueapple: it-was-surprising-but-it-was-a-blueapple

as output.

i only can think to search all keys if they has a matching substring, is there some other solution? If i search all the key strings for every query, use of hashing is unneeded, meaningless, is it?

But, searching all keys for substring is O(N), i am expected to solve the problem with O(1).

With hashing i can hash a key e.g. "redapple" to e.g. 943, and "maninred" to e.g. 332.

And query man give the string "red" how can i found out from 943 and 332 that the keys has "red" substring? It is out of my cs thinking skills.

Thanks for any advise, idea.


Solution

  • Possible you should use the invert index for n-gramm, the same approach is used for spell correction. For word redapple you will have following set of 3-gramms red, eda, dap, app, ppl, ple. For each n-gramm you will have a list of string in which contains it. For example for red it will be

    red -> maninred, redapple

    words in this list must be ordered. When you want to find the all string that contains a a give substring, you dived the substring on n-gramm and intercept the list of words for n-gramm.

    This alogriphm is not O(n), but it practice it has enough speed.