Is there a set-like data structure optimized for searches where it is known ahead of time there will be a high percent of matches?

I have a use case where a set of strings will be searched for a particular string, s. The percent of hits or positive matches for these searches will be very high. Let's say 99%+ of the time, s will be in the set.

I'm using boost::unordered_set right now, and even with its very fast hash algorithm, it takes about ~~40ms~~ 600ms on ~~good hardware~~ a VM to search the set 500,000 times. Yeah, that's pretty good, but unacceptable for what I'm working on.

So, is there any sort of data structure optimized for a high percent of hits? I cannot precompute the hashes for the strings coming in, so I think I'm looking at a complexity of \$O(avg length of string)\$ for a hash set like boost::unordered_set. I looked at Tries, these would probably perform well in the opposite case where there is rarely hits, but not really any better than hash sets.

edit: some other details with my particular use case:

the number of strings in the set is around 5,000. The longest string is probably no more than 200 chars. Search gets called again and again with the same strings, but they are coming in from an outside system and I cannot predict what the next string will be. The exact match rate is actually 99.975%.

edit2: I did some of my own benchmarking

I collected 5,000 of the strings that occur in the real system. I created two scenarios.

1) I loop over the list of known strings and do a search for them in the container. I do this for 500,000 searches("hits").

2) I loop through a set of strings known not to be in the container, for 500,000 searches ("misses").

(Note - I'm interested in hashing the data in reverse because eyeballing my data, I noticed that there are a lot of common prefixes and the suffixes differ - at least that is what it looks like.)

Tests done on a virtualbox CentOS 5.6 VM running on a macbook host.

                                                                  hits (ms)   misses (ms)
boost::unordered_set with default hash and no reserved size:                591.15     441.39    
tr1::unordered_set with default hash                                        191.09     143.80 
boost::unordered_set with a reserve size set:                               579.31     431.54   
boost::unordered_set w/custom hash (hash on the last 15 chars + str size):  357.34     812.13
boost::unordered_set w/custom hash (hash on the last 25 chars + str size):  362.60     795.33
trie:                                                                      1809.34      58.11
trie with reversed insertion/search:                                       2806.26     311.14

In my tests, where there are a lot of matches, the tr1 set is the best. Where there are a lot of misses, the Trie wins big.

my test loop looks like this, where function_set is the container being tested loaded with 5,000 strings, and functions is a vector of either all the strings in the container or a bunch of strings that are not in the container.

while (searched < kTotalSearches) {
    for(std::vector<std::string>::const_iterator i = functions.begin(); i != functions.end(); ++i) {
       function_set.count(*i);
       searched++;
       if (searched == kTotalSearches)
           break;
    }
}
std::cout << searched << " searches." << std::endl;

Solution

This question piqued my curiosity so I did a few tests to satisfy myself with the following results. A few general notes:

The usual caveats about benchmarking apply (don't trust my numbers, do your own benchmarks with your specific use case and data, etc...).
Tests were done using MSVS C++ 2010 (speed optimized, release build).
Benchmarks were run using 10 million loops to improve timing accuracy.
Names were generated by randomly concatenating 20 different strings fragments into strings ranging from 4 to 65 characters in length.
Names included only letters and some tests (trie) were case-insensitive for simplicity, though there's no reason the methods can't be extended to include other characters.
Tests try to match the 99.975% hit rate given in the question.

Test Descriptions

Basic description of the tests run with the relevant details:

String Iteration -- Simply iterates through the function name for a baseline time comparison.
Map -- std::unordered_map<std::string, int>
Set -- std::unordered_set<std::string>
BoostSet -- boost::unordered_set<std::string>, v1.47.0
CharMap -- std::unordered_map<const char*, int>
CharSet -- std::unordered_set<const char*>
FastMap -- Simply a std::unordered_map<> using a custom FNV-1a hash algorithm.
FastSet -- Simply a std::unordered_set<> using a custom FNV-1a hash algorithm.
CustomMap -- A basic hash map I wrote myself years ago.
Trie -- A standard trie downloaded from Google code.
CustomTrie -- A bare-bones trie I wrote myself.
BinarySearch -- Using std::binary_search() on a sorted std::vector<std::string>.
SortArrayMap -- An attempt to use a size_t VectorIndex[26][26][26][26][26] array to index into a sorted array.
PerfectMap -- A std::unordered_map<> using a perfect hash from gperf.
PerfectWordSet -- Using the gperf is_word_set() function directly.
PerfectWordSetFunc -- Same as PerfectWordSet but called in a function instead of inline.
PerfectWordSetThread -- Same as PerfectWordSet but work is split into N threads (standard Window threads). No synchronization is used except for waiting for the threads to finish.

Test Results (Mostly Hits)

Results sorted from slowest to fastest (for the case of mostly hits, ~99.975%):

Trie -- 9100 ms
SortArrayMap -- 6600 ms
PerfectWordSetFunc -- 4050 ms
CustomTrie -- 3470 ms
BinarySearch -- 3420 ms
CustomMap -- 2700 ms
CharSet -- 1300 ms
CharMap -- 1300 ms
BoostSet -- 1200 ms
FastSet -- 970 ms
FastMap -- 930 ms
Original Poster -- 800 ms (estimated)
Set -- 730 ms
Map -- 690 ms
PerfectMap -- 650 ms
PerfectWordSet -- 500 ms
PerfectWordSetThread(1) -- 500 ms
StringIteration -- 350 ms
PerfectWordSetThread(2) -- 260 ms
PerfectWordSetThread(4) -- 150 ms
PerfectWordSetThread(32) -- 125 ms
PerfectWordSetThread(8) -- 120 ms
PerfectWordSetThread(16) -- 110 ms

Test Results (Mostly Misses)

Results sorted from slowest to fastest (for the case of mostly misses, ~0.1% hits):

BinarySearch -- ? (took too long)
SortArrayMap -- 8050 ms
Trie -- 3200 ms
CustomMap -- 1700 ms
BoostSet -- 920 ms
CustomTrie -- 850 ms
FastMap -- 590 ms
FastSet -- 580 ms
CharSet -- 550 ms
CharMap -- 550 ms
StringIteration -- 350 ms
Set -- 330 ms
Map -- 330 ms
PerfectMap -- 280 ms
PerfectWordSet -- 140 ms
PerfectWordSetThread(1) -- 130 ms
PerfectWordSetThread(2) -- 75 ms
PerfectWordSetThread(4) -- 45 ms
PerfectWordSetThread(32) -- 45 ms
PerfectWordSetThread(8) -- 40 ms
PerfectWordSetThread(16) -- 35 ms

Discussion

My first guess was that a trie would be a good fit for this sort of thing but from the results the opposite actually appears to be true. Thinking about it some more this makes sense and is along the same reasons to not use a linked-list.

I assume you may be familiar with the table of latencies that every programmer should know. In your case you have 500k lookups executing in 40ms, or 80ns/lookup. At that scale you easily lose if you have to access anything not already in the L1/L2 cache. A trie is really bad for this as you have an indirect and probably non-local memory access for every character. Given the size of the trie in this case I couldn't figure any way of getting the entire trie to fit in cache to improve performance (though it may be possible). I still think that even if you did get the trie to fit entirely in L2 cache you would lose with all the indirection required.

The std::unordered_ containers actually do a very good job of things out of the box. In fact, in trying to speed them up I actually made them slower (in the poorly named FastMap and FastSet trials). Same thing with trying to switch from std::string to const char * (about twice as slow).

The boost::unordered_set<> was twice as slow as the std::unordered_set<> and I don't know if that is because I just used the built-in hash function, was using a slightly old version of boost, or something else. Have you tried std::unordered_set<> yourself?

By using gperf you can easily create a perfect hash function if your set of strings is known at compile time. You could probably create a perfect hash at runtime as well, depending on how often new strings are added to the map. This gets you a 23% speed increase over the standard map implementation.

The PerfectWordSetThread tests simply use the perfect hash and splits the work up into 1-32 threads. This problem is perfectly parallel (at least the benchmark is) so you get almost a 5x boost of performance in the 16 threads case. This works out to only 6.3ms/500k lookups, or 13 ns/lookup...a mere 50 cycles on a 4GHz processor.

The StringIteration case really points out how difficult it is going to be to get much faster. Just iterating the string being found takes 350 ms, or 70% of the time compared to the 500 ms map case. Even if you could perfectly guess each string you would still need this 350 ms (for 10 million lookups) to actually compare and verify the match.

Edit: Another thing that illustrates how tight things are is the difference between the PerfectWordSetFunc at 4050 ms and PerfectWordSet at 500 ms. The only difference between the two is that one is called in a function and one is called inline. Calling it as a function reduces the speed by a factor of 8. In basic pseudo-code this is just:

bool IsInPerfectWordSet (string Match)
{
    return in_word_set(Match);
}

  //Inline benchmark: PerfectWordSet
for i = 1 to 10,000,000
{
    if (in_word_set(SomeString)) ++MatchCount;   
}

  //Function call benchmark: PerfectWordSetFunc
for i = 1 to 10,000,000
{
    if (IsInPerfectWordSet(SomeString)) ++MatchCount;   
}

This really highlights the difference in performance that inline code/functions can make. You also have to be careful in making sure what you are measuring in a benchmark. Sometimes you would want to include the function call overhead, and sometimes not.

Can You Get Faster?

I've learned to never say "no" to this question, but at some point the effort may not be worth it. If you can split the lookups into threads and use a perfect, or near-perfect, hash function you should be able to approach 100 million lookup matches per second (probably more on a machine with multiple physical processors).

A couple ideas I don't have the knowledge to attempt:

Assembly optimization using SSE
Use the GPU for additional throughput
Change your design so you don't need fast lookups

Take a moment to consider #3....the fastest code is that which never needs to run. If you can reduce the number of lookups, or reduce the need for an extremely high throughput, you won't need to spend time micro-optimizing the ultimate lookup function.