Search code examples
stringc++11dictionarysvmsparse-matrix

map a million strings to ints in c++11


I have a million ASCII strings, without duplicates, each at most 7 bytes long. I need to map each string to a positive integer. The largest of these ints should be not much more than a million. Although initialization may be slow, lookup should be fast: given a string, return the corresponding int (or -1, if not found). How can one implement this in C++11?

One solution: accumulate the strings into a std::unordered_map<string,int>; then iterate over the map, assigning the ints from an incrementing counter. Then to lookup, just unordered_map::find("foo")->second. But it smells like some other container would be faster and have less overhead (indices built in, rather than hand-coded). Maybe unordered_set and pointer arithmetic??

The range restriction seems to make a perfect hash difficult.

(The int's range is restricted, because it indexes into a feature vector passed to svm_light. That software doesn't use sparse storage, so vectors with trillions of (mostly zero) elements make it run out of memory. So this string-to-int preprocessing sort of implements a sparse data structure.)


Solution

  • What you describe looks like perfect hashing.

    There are C++ libraries which implement perfect hash, for example Tiny perfect hash library for C, C++, and Lua.