I have a use case where a set of strings will be searched for a particular string, s
. The percent of hits or positive matches for these searches will be very high. Let's say 99%+ of the time, s
will be in the set.
I'm using boost::unordered_set
right now, and even with its very fast hash algorithm, it takes about 40ms 600ms on good hardware a VM to search the set 500,000 times. Yeah, that's pretty good, but unacceptable for what I'm working on.
So, is there any sort of data structure optimized for a high percent of hits? I cannot precompute the hashes for the strings coming in, so I think I'm looking at a complexity of \$O(avg length of string)\$ for a hash set like boost::unordered_set
. I looked at Tries, these would probably perform well in the opposite case where there is rarely hits, but not really any better than hash sets.
edit: some other details with my particular use case:
the number of strings in the set is around 5,000. The longest string is probably no more than 200 chars. Search gets called again and again with the same strings, but they are coming in from an outside system and I cannot predict what the next string will be. The exact match rate is actually 99.975%.
edit2: I did some of my own benchmarking
I collected 5,000 of the strings that occur in the real system. I created two scenarios.
1) I loop over the list of known strings and do a search for them in the container. I do this for 500,000 searches("hits").
2) I loop through a set of strings known not to be in the container, for 500,000 searches ("misses").
(Note - I'm interested in hashing the data in reverse because eyeballing my data, I noticed that there are a lot of common prefixes and the suffixes differ - at least that is what it looks like.)
Tests done on a virtualbox CentOS 5.6 VM running on a macbook host.
hits (ms) misses (ms)
boost::unordered_set with default hash and no reserved size: 591.15 441.39
tr1::unordered_set with default hash 191.09 143.80
boost::unordered_set with a reserve size set: 579.31 431.54
boost::unordered_set w/custom hash (hash on the last 15 chars + str size): 357.34 812.13
boost::unordered_set w/custom hash (hash on the last 25 chars + str size): 362.60 795.33
trie: 1809.34 58.11
trie with reversed insertion/search: 2806.26 311.14
In my tests, where there are a lot of matches, the tr1 set is the best. Where there are a lot of misses, the Trie wins big.
my test loop looks like this, where function_set is the container being tested loaded with 5,000 strings, and functions is a vector of either all the strings in the container or a bunch of strings that are not in the container.
while (searched < kTotalSearches) {
for(std::vector<std::string>::const_iterator i = functions.begin(); i != functions.end(); ++i) {
function_set.count(*i);
searched++;
if (searched == kTotalSearches)
break;
}
}
std::cout << searched << " searches." << std::endl;
This question piqued my curiosity so I did a few tests to satisfy myself with the following results. A few general notes:
Basic description of the tests run with the relevant details:
std::unordered_map<std::string, int>
std::unordered_set<std::string>
boost::unordered_set<std::string>
, v1.47.0std::unordered_map<const char*, int>
std::unordered_set<const char*>
std::unordered_map<>
using a custom FNV-1a hash algorithm.std::unordered_set<>
using a custom FNV-1a hash algorithm.std::binary_search()
on a sorted std::vector<std::string>
.size_t VectorIndex[26][26][26][26][26]
array to index into a sorted array.std::unordered_map<>
using a perfect hash from gperf.is_word_set()
function directly.Results sorted from slowest to fastest (for the case of mostly hits, ~99.975%):
Results sorted from slowest to fastest (for the case of mostly misses, ~0.1% hits):
My first guess was that a trie would be a good fit for this sort of thing but from the results the opposite actually appears to be true. Thinking about it some more this makes sense and is along the same reasons to not use a linked-list.
I assume you may be familiar with the table of latencies that every programmer should know. In your case you have 500k lookups executing in 40ms, or 80ns/lookup. At that scale you easily lose if you have to access anything not already in the L1/L2 cache. A trie is really bad for this as you have an indirect and probably non-local memory access for every character. Given the size of the trie in this case I couldn't figure any way of getting the entire trie to fit in cache to improve performance (though it may be possible). I still think that even if you did get the trie to fit entirely in L2 cache you would lose with all the indirection required.
The std::unordered_
containers actually do a very good job of things out of the box. In fact, in trying to speed them up I actually made them slower (in the poorly named FastMap and FastSet trials).
Same thing with trying to switch from std::string
to const char *
(about twice as slow).
The boost::unordered_set<>
was twice as slow as the std::unordered_set<>
and I don't know if that is because I just used the built-in hash function, was using a slightly old version of boost, or something else. Have you tried std::unordered_set<>
yourself?
By using gperf
you can easily create a perfect hash function if your set of strings is known at compile time. You could probably create a perfect hash at runtime as well, depending on how often new strings are added to the map. This gets you a 23% speed increase over the standard map implementation.
The PerfectWordSetThread tests simply use the perfect hash and splits the work up into 1-32 threads. This problem is perfectly parallel (at least the benchmark is) so you get almost a 5x boost of performance in the 16 threads case. This works out to only 6.3ms/500k lookups, or 13 ns/lookup...a mere 50 cycles on a 4GHz processor.
The StringIteration case really points out how difficult it is going to be to get much faster. Just iterating the string being found takes 350 ms, or 70% of the time compared to the 500 ms map case. Even if you could perfectly guess each string you would still need this 350 ms (for 10 million lookups) to actually compare and verify the match.
Edit: Another thing that illustrates how tight things are is the difference between the PerfectWordSetFunc at 4050 ms and PerfectWordSet at 500 ms. The only difference between the two is that one is called in a function and one is called inline. Calling it as a function reduces the speed by a factor of 8. In basic pseudo-code this is just:
bool IsInPerfectWordSet (string Match)
{
return in_word_set(Match);
}
//Inline benchmark: PerfectWordSet
for i = 1 to 10,000,000
{
if (in_word_set(SomeString)) ++MatchCount;
}
//Function call benchmark: PerfectWordSetFunc
for i = 1 to 10,000,000
{
if (IsInPerfectWordSet(SomeString)) ++MatchCount;
}
This really highlights the difference in performance that inline code/functions can make. You also have to be careful in making sure what you are measuring in a benchmark. Sometimes you would want to include the function call overhead, and sometimes not.
I've learned to never say "no" to this question, but at some point the effort may not be worth it. If you can split the lookups into threads and use a perfect, or near-perfect, hash function you should be able to approach 100 million lookup matches per second (probably more on a machine with multiple physical processors).
A couple ideas I don't have the knowledge to attempt:
Take a moment to consider #3....the fastest code is that which never needs to run. If you can reduce the number of lookups, or reduce the need for an extremely high throughput, you won't need to spend time micro-optimizing the ultimate lookup function.