When an approximated comparison between strings is required, the basic Levenshtein Distance can help. It measures the amount of modifications of the string needed to equal another string:
"aaaa" vs "aaab" => 1
"abba" vs "aabb" => 2
"aaaa" vs "a" => 3
When using a Dictionary<T, U>
one can provide a custom IEqualityComparer<T>
. One can implement the Levenshtein Distance as an IEqualityComparer<string>
:
public class LevenshteinStringComparer : IEqualityComparer<string>
{
private readonly int _maximumDistance;
public LevenshteinStringComparer(int maximumDistance)
=> _maximumDistance = maximumDistance;
public bool Equals(string x, string y)
=> ComputeLevenshteinDistance(x, y) <= _maximumDistance;
public int GetHashCode(string obj)
=> 0;
private static int ComputeLevenshteinDistance(string s, string t)
{
// Omitted for simplicity
// Example can be found here: https://www.dotnetperls.com/levenshtein
}
}
So we can use a fuzzy dictionary:
var dict = new Dictionary<string, int>(new LevenshteinStringComparer(2));
dict["aaa"] = 1;
dict["aab"] = 2; // Modify existing value under "aaa" key
// Only one key was created:
dict.Keys => { "aaa" }
Having all this set up, you may have noticed that we don't have implemented a proper GetHashCode
in the LevenshteinStringComparer
which would be greatly appreciated by the dictionary. As some rule of thumbs regarding hash codes, I'd use:
The only possible hash function following these rules I can imagine is a constant number, just as implemented in the given code. This isn't optimal though, but when we start for example to take the default hash of the string, then aaa
and aab
would end up with different hashes, even though they are handled as equal. Thinking further this means all possible strings have to have the same hash.
Am I correct? And why does the performance of the dictionary gets better when I use the default string hash function with hash collisions for our comparer? Shouldn't this make the hash buckets inside the dictionary invalid?
public int GetHashCode(string obj)
=> obj.GetHashCode();
I don't think there is a hashing function that could work in your case.
The problem is that you have to assign the bucket based on a signle value only, while you can't know what was added before. But the Levenshtein distance of the item being hashed can be anything from 0 to "infinity", only thing that matters is what it is compared with. Hence you cannot satisfy the second condition of the hashing function (to have equal objects have the same hash code).
Another argument "pseudo-proof" would be the situation when you want maximum distance of 2 and you already have two items in the dictionary, which have mutual distance of 3. If you then add a string which is of distance 2 from the first item and distance 1 from the second item, how would you decide which item should it match to? It satisfies your maximum for both items, but it should probably match with the second one rather than the first one. But not knowing anything about the contents of the dictionary you cannot know how to hash it correctly.
For the second question - using the default string.GetHashCode()
method does improve performance, but it destroys the functionality of your equality comparer. If you test this solution on your sample code, you can see that the dict
will contain two keys now. This is because GetHashCode
returned two different hash codes, so there was no conflict and dict
now has two buckets and your Equals
method was not even executed.