I'm implementing a caching table to avoid having to perform a costly operation that creates a generic object from a set of parameters describing it. Once an object is requested, an hash of these parameters is computed, and a Dictionary containing the already created objects is queried to check if a copy has already been created, in which case its returned without the need of creating it again.
My problem lies in the fact that since the parameters describing these objects can be many, collisions in the hashing function are unavoidable (and too frequent), but on the other hand retrieving these objects is a performance-critical operation and i cannot afford full comparisons on all existing descriptions to search among already created objects. I've tried to solve with many different hashing functions but since the nature of the parameters is unknown the results are unreliable.
What solutions other than hashing are there to this caching problem, or can hashing be used differently to avoid conflicts? C# description of the problem:
class ObjectDescriptor
{
// description made of a list of parameters of unknown type
public object[] Fields;
// hashing procedure that may have conflicts
public override int GetHashCode()
{
int hash = 1009;
for (int i = 0; i < Fields.Length; i++)
{
unchecked { hash = hash * 9176 + Fields[i].GetHashCode(); }
}
return hash;
}
}
abstract class ObjectCache<T>
{
private Dictionary<int, T> indexedObjects;
// this operation is called many times and must be fast
public T Get(ObjectDescriptor descr)
{
T cachedValue;
if(!indexedObjects.TryGetValue(descr.GetHashCode(), out cachedValue))
{
cachedValue = CreateObject(descr);
indexedObjects[descr.GetHashCode()] = cachedValue;
}
return cachedValue;
}
// costly operation
protected abstract T CreateObject(ObjectDescriptor desc);
}
I'll leave the solution I ended up using. This is based on the fact that conflicts can be avoided by storing whole values from multiple fields in a single hash where possible:
byte b1 = 42, b2 = 255;
int naiveHash = CombineHash(b1.GetHashCode(), b2.GetHashCode()); // will always have conflicts
int typeAwareHash = b1 << 8 + b2; // no conflicts
To know how many bits are required by a field I required the implementation of IObjectDescriptorField:
interface IObjectDescriptorField
{
int GetHashCodeBitCount();
}
I then updated the ObjectDescriptor class with an HashCodeBuilder class:
class ObjectDescriptor
{
public IObjectDescriptorField[] Fields;
public override int GetHashCode()
{
HashCodeBuilder hash = new HashCodeBuilder();
for (int i = 0; i < Fields.Length; i++)
{
hash.AddBits(Fields[i].GetHashCode(), Fields[i].GetHashCodeBitCount());
}
return hash.GetHashCode();
}
}
HashCodeBuilder stack up bits until all 32 are used, and then uses a simple hash combination function like before:
public class HashCodeBuilder
{
private const int HASH_SEED = 352654597;
private static int Combine(int hash1, int hash2)
{
return ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ hash2;
}
private int hashAccumulator;
private int bitAccumulator;
private int bitsLeft;
public HashCodeBuilder()
{
hashAccumulator = HASH_SEED;
bitAccumulator = 0;
bitsLeft = 32;
}
public void AddBits(int bits, int bitCount)
{
if (bitsLeft < bitCount)
{
hashAccumulator = Combine(hashAccumulator, bitAccumulator);
bitsLeft = 32;
hashAccumulator = 0;
}
bitAccumulator = bitAccumulator << bitCount + bits;
bitsLeft -= bitCount;
}
public override int GetHashCode()
{
return Combine(hashAccumulator, bitAccumulator);
}
}
This solution of course still have conflicts if more then 32 bits are used, but it worked for me because many of the fields where just bools or Enums with few values, which greatly benefit to be combined like this.