I am working on a project that interprets election data, using custom data structures. Currently I am deciding on what datastructure is best for storing information about the final amount of votes the candidates achieved in different territorial units.
Since this is a homework, datastructures built in the language and datastructures from external libraries are forbidden. Also, complexity for search must be lesser than O(n).
The hash function I intend to use looks like this
The key type would be of type unsigned int, the key itself would be the candidate's number on the ballot.
template<typename K, typename T>
inline int CandidateResultsHashTable<K, T>::hashFunction(const K & key) const
{
return key % (amount_of_candidates + 1);
}
The amount of candidates is known, although it can change between election rounds. All the data stored inside the hash table would be read from a file, which contains data for all candidates. So there shouldn't be any number that doesn't belong to a candidate.
I want to know, which implementation would be better based on access times and memory usage.
I've aggregated my comments into one answer.
This is a summary of different methods to implement a data structure called map (dictionary in some other languages).
The simplest way of solving your problem would be to an array/list of key-value pairs which you just check one by one until you find the right key. It has very poor efficiency, though. O(n) is good only for small data sets. Speed doesn't matter that much and in case of very low amounts of data, this approach may be even faster due to the overhead that more sophisticated data structures have (e.g. calculating hash function).
This approach can be optimized quite significantly if you sort your keys and use binary search which is only O(log(n)).
Hash table is rather tricky to implement. You need good enough hash function. Good hash function means that it has low amount of collisions - situations when two different keys have the same hash. You need program for this situation anyway but too high number of collisions decrease benefits from using hash table.
Your implementation is quite simple.
key % (amount_of_candidates + 1)
It is hard to tell if it is good enough without knowing how keys are assigned.
If keys are just consecutive numbers is perfectly good. (You don't even need + 1
.) Actually, in that situation you have a special case for hash table where you don't need to check for collisions because you can tell there won't be any.
At this point you can stop pretending that you use hash table and just make an array ;) Position of each candidate is just key - smallest_key
. In fact, this would be a very effective solution: O(1).
You cannot simplify it that much if keys are assigned randomly. In this case your solution is mostly good. However, (amount_of_candidates + 1)
is too small size for the hash table. It should be about 30% bigger than the amount of data (load factor). This will decrease number of collisions to a reasonable level.
Yet another solution would be to use a binary tree which directly maps to binary representation of the key. (0 - left branch, 1 right branch) This is a method very similar to binary search in array but it allows to easily add new elements without resizing the array and sorting the new element into it. The disadvantage of that solution would be higher memory requirements.
You could also experiment with other types of binary trees. You just need to remember to keep them balanced so they stay efficient. I don't really know much about balancing so I won't write more in this topic.
I infer that, in you case, keys are just consecutive integers so I would recommend the solution which uses a plain array with indices tier directly to values of keys. This is a very simple and at the same time very effective solution.
OK, let's actually answer the question from the title.
The implementation of perfect hash function you showed is no different than an array. It is just another way of coding the same thing and depending on some factors the result assembly may be the same.
In the case of other hash function where keys are distributed over the whole range of K
, straight array would be impractical / impossible to use due to huge amount of memory it would need. If you would succeed in allocating this amount of memory, array would be slightly faster because it wouldn't require to calculate hashes but it certainly wouldn't be worth it.