Search code examples
c#algorithmfull-text-searchcontainscontainskey

Appropriate datastructure for key.contains(x) Map/Dictionary


I am somewhat struggling with the terminology and complexity of my explanations here, feel free to edit it.

I have 1.000 - 20.000 objects. Each one can contain several name words (first, second, middle, last, title...) and normalized numbers(home, business...), email adresses or even physical adresses and spouse names.

I want to implement a search that enables users to freely combine word parts and number parts.
When I search for "LL 676" I want to find all objects that contain any String with "LL" AND "676".
Currently I am iterating over every object and every objects property, split the searchString on " " and do a stringInstance.Contains(searchword).

This is too slow, so I am looking for a better solution.

What is the appropriate language agnostic data structure for this?
In my case I need it for C#.

Is the following data structure a good solution?


It's based on a HashMap/Dictionary.
At first I create a String that contains all name parts and phone numbers I want to look through, one example would be: "William Bill Henry Gates III 3. +436760000 billgatesstreet 12":
Then I split on " " and for every word x I create all possible substrings y that fullfill x.contains(y). I put every of those substrings inside the hashmap/dictionary.
On lookup/search I just need to call the search for every searchword and the join the results. Naturally, the lookup speed is blazingly fast (native Hashmap/Dictionary speed).

EDIT: Inserts are very fast as well (insignificant time) now that I use a smarter algorithm to get the substrings.


Solution

  • It's possible I've misunderstood your algorithm or requirement, but this seems like it could be a potential performance improvement:

    foreach (string arg in searchWords)
    {
        if (String.IsNullOrEmpty(arg))
            continue;
        tempList = new List<T>();
    
        if (dictionary.ContainsKey(arg))
            foreach (T obj in dictionary[arg])
            if (list.Contains(obj))
                    tempList.Add(obj);
        list = new List<T>(tempList);
    }
    

    The idea is that you do the first search word separately before this, and only put all the subsequent words into the searchWords list.

    That should allow you to remove your final foreach loop entirely. Results only stay in your list as long as they keep matching every searchWord, rather than initially having to pile everything that matches a single word in then filter them back out at the end.