I'm new to hash tables and functions, so I apologize in advance if I got anything wrong.
I'm trying to create a hash table in C++ for a list of about 100k entries comprised of a 7 digit number. The thing is, I got stuck while trying to figure out what hash function to use.
When using %100000 I got ~65k unique keys, while there are ~90k unique entries. Which means that about 1/3 of the data will have collisions.
Is this a good hash function to use? Or is there a better function to use in that case in order to have less collisions? What size should my table be?
Thanks again!
Edit- The entries are numbers between 1 and 2 mil. Is it possible to use the number itself as the key? Or should keys for hash table always start at 0?
The standard library comes with two types internally using a hash table: std::unordered_map
and std::unordered_set
.
As your key type is an integral type you get a hash table pretty conveniently by std::unordered_map<YourIdType, YourDataType
. You can easily access the data via theMap[someId]
, but be aware that if the key is not found a new object is created! If that's not desired you'd rather use theMap.find(someId)
, which returns an iterator.
The drawback, though, is that you store the id twice then (internally as a std::pair<YourIdType, YourDataType>
). You can avoid that by using a std::unordered_set
. To be able to do so, though, you need to specialise std::hash
and std::equal_to
for your type:
namespace std // you are not allowed to add to – with exception of specialisations
{
template<>
struct hash<YourDataType>
{
size_t operator()(YourDataType const& object) const
{
return hash<YourIdType>()(object.id);
}
};
// analogously equal_to with appropriate comparisons, possibly by
// comparing the object's addresses
Alternatively you can provide a custom hasher type (with C++20 that can even be a lambda packed into decltype
) to the set as second template parameter and just implement operator==
for your object type, or provide a custom equality comparer type if you need it to compare differently than the operator, maybe like:
// C++20 required:
using YourMapType = std::set
<
YourDataType,
decltype
(
[](YourDataType const& object)
{ return std::hash<YourIdType>()(object.id); }
),
decltype
(
[](YourDataType const& o1, YourDataType const& o2)
{ return &o1 == &o2; } // TODO: comparisons as you need!
)
>;
// alternatively create custom types with appropriate operator() implementations
Drawback here is – apart from additional complexity for the specialisations – that you cannot lookup objects by the id only, instead you need a complete object of your data type.
So which one is more appropriate/suitable? That depends on your concrete requirements...