Search code examples
cdata-structureshashmaptrie

What data structure to use? (hash map vs. trie vs. ?)


I have a C function that produces about 6 million unique arrays. These arrays always have 17 elements each, and each element is an integer from 0 to 16. I also have a slightly modified version of that function that will also produce about 6 million unique arrays of the same kind. My problem is that the second one produces about 45,000 results less than the first, and I'd like to see what these results are.

So my approach is to simply store all the results of the second function (calculator tells me this should not take more than 400 mb which is fine to keep in-memory) and then look up the results of the first, printing out the ones that don't exist.

Assuming the general approach makes sense (and if not, do tell), what I am looking for is an appropriate data structure (ideally with a good implementation in C) that can hold about 6 million unique permutations of

[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]

(or some transformation thereof) and then perform fast membership testing on them. As the title says, I do have some suspicions about which data structures may do the job, but I am not certain tries or hashmaps are the best choice for this.

This is an algorithm to detect a flaw in another algorithm, not something that will be used in production. I am interested in doing this in a way that will be coded and return results relatively quickly in human terms, not necessarily shave milliseconds, so existence of easy to grok libraries that will do most of the job is definitely a plus.


Solution

  • Optimality would kind of depend on how the permutations are distributed and the ratio of insertions to searches. Since you are not concerned with optimality, but just want a straightforward way to test a hypothesis without waiting all night for results, my gut says:

    An integer [0,16] can be represented as a five bit number, so seventeen of them can be represented as an 85-bit (11-byte) binary string. So, you can just use one of the many libraries available for storing sorted/hashed sets of strings with membership tests on them, and be done. It won't be quite as fast or cache-coherent as a tuned trie, but it'll be good enough to grind through 66mb of data in a few seconds, and you'll be done by lunch.

    If no such library is conveniently to hand and you have to work from scratch, I'd just make a sorted list of the strings and then do the membership tests via binary search. That works out to something like O( n log n + m( n log n ) ) = O( 2×mn log n ) eg quadratic time as m→n. If this is only being run as an offline job once or twice during production, that might be good enough; if you're going to do this more than once a day, I'd worry about cache locality and use a trie or B-tree.