Search code examples
c#dictionarystring-matching

Finding the longest Dictionary.Key match in a phrase


I have a SortedDictionary<string, string>, ordered by key length descending, of the form:

  • red fox - address1
  • weasel - address2
  • foxes - address3
  • fox - address3 etc.
    and a list of phrases e.g.
    "today the red fox is home"
    "no foxes today"
    etc

I need to find the value of the longest key in the dictionary that matches a substring in a given phrase. Given that the dictionary has about one thousand entries, is there a better approach than a brute force search?

The examples above should return:
"red fox" - address1
"foxes" - address3
I have the brute force solution of iterating over the Keys, but in view of the number of keys and the need to perform the search in tens of phrases, I'm looking for a cleverer approach. -1

I did some performance measuring.
With a dictionary of 100 keys, searching for matches in 100 phrases took on the average 65 ms.
If the execution time scales linearly with the dictionary length and with the number of phrases, then the whole process will take about 20 seconds (search time only).
Taking into account the time needed to read the phrases from files and to make use of the matches, it will take probably 3-4 minutes for the whole run.
The user will have to look at some animation for this time, or she could drink a glass of something.
Would this be acceptable?
If yes, the brute force approach is OK.
If not, I am still open to suggestions.


Solution

  • I solved it with a different approach:
    Given:

    • a collection of F files, each containing P phrases of W words each; typically F is 300, P=10 on the average and W=5 on the average.
    • a Dictionary containing N entries; N=1000

    For each phrase in the files, find the longest 1- to 3-word sub-phrase matching one of the dictionary keys.

    Solution:
    For each phrase p in F, create a List containing all 1-, 2- and 3-word sub-phrases, ordered by length descending; there will be 3W-3 sub-phrases.
    For each sub-phrase in the List, check if it matches a key in the Dictionary.

    (In contrast, the brute force approach, for each phrase p in each file, iterates over the dictionary keys, using p.Contains(key) to find a match.)

    Performance: The time spent is linearly proportional to F, P and W and constant for the dictionary lookup.
    This is an improvement over the brute force approach, which is proportional to the dictionary size as well. Thus, the performance gain goes up with the dictionary size. Additionally, the brute force approach uses string.Contains() which is time consuming as well. To sum up, the brute force solution is O(N) while the sub-phrase solution is O(1).
    I think there is no need to post any code, which is just a simple double loop which builds the sub-phrases list.