Search code examples
c#arrayshashchecksum

C# Is it possible to generate an identifier for array of double values


I am working with existing data and have records which contain an array double[23] and double[46]. The values in the array can be the same across multiple records. I would like to generate an id (perhaps an int) to uniquely identify the values in each array.

There are places in the application where I need to group records based on the values in the array being identical. While there are ways to query for this, I was hoping for a single int field (or something similar) to group on. This would really help simplify queries and especially help with report tools where grouping on a smaller single field would help immensely.

I thought of generating a hash code, but I understand these are not guaranteed to be the same for each double[] with matching values. I had tried implementing

((IStructuralEquatable)combined).GetHashCode(EqualityComparer<double>.Default);

To compare the structure and data, but again, I don't think this is guaranteed to match another double[] having the same values.

Perhaps a form of checksum would work but admittedly I am having trouble implementing something. I am looking for suggestions/direction.

Here is data for 3 sample records. Data in record 1&3 are the same so a generated id should match for those. 32.7,48.9,55.9,48.9,47.7,46.9,45.7,44.4,43.4,41.9,40.4,38.4,36.7,34.4,32.4,30.4,27.9,25.4,22.4,19.4,16.4,13.4,10.4,47.9 40.8,49.0,50.0,49.0,47.8,47.0,45.8,44.5,43.5,42.0,40.5,38.5,36.8,34.5,32.5,30.5,28.0,25.5,22.5,19.5,16.5,13.5,10.5,48.0 32.7,48.9,55.9,48.9,47.7,46.9,45.7,44.4,43.4,41.9,40.4,38.4,36.7,34.4,32.4,30.4,27.9,25.4,22.4,19.4,16.4,13.4,10.4,47.9

Perhaps this is not possible without just checking all the data, but was hoping for a better solution to simplify the application and improve the speed.

The goal is to add a new id field to the existing records to represent the array data. That way, passing records into report tools would group together easily on one field rather than checking the whole array on each record.

I appreciate any direction.

EDIT - Some issues I ran into trying things (incase it helps someone)

In trying to understand this originally, I was calling this code (which is part of .NET). I understood these functions would hash the values of the array together (only 8 values in this case). I didn't think it included the array handle. The result was not quite as expected as there is a bug MS corrected in .NET as per the commented line below. With the fix I was getting better results.

int IStructuralEquatable.GetHashCode(IEqualityComparer comparer) {
        if (comparer == null)
            throw new ArgumentNullException("comparer");
        Contract.EndContractBlock();

        int ret = 0;

        for (int i = (this.Length >= 8 ? this.Length - 8 : 0); i < this.Length; i++) {
            ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(i))); 
//.NET 4.6.2, in .NET 4.5.2 it is ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(0))) 
        }

        return ret;
    }

    internal static int CombineHashCodes(int h1, int h2) {
        return (((h1 << 5) + h1) ^ h2);
    }

I modified this to handle more than 8 values and still had some hashes not matching. I later determined the issue was in the data; I was unaware some of the records had some doubles stored with more than one decimal place (should have been rounded). This of course changed the hash. Now that I have the data consistent, I am seeing matching hashes; any arrays with identical values have an identical hash.


Solution

  • I thought of generating a hash code, but I understand these are not guaranteed to be the same for each double[] with matching values

    Quite the opposite, a hash function is required by design to return equal hashes for equal inputs. For example, 0 is a good starting point for your hash function, returning the value 0 for equal rows. Everything else is just an optimization to try to reduce false positives.

    Perhaps this is not possible without just checking all the data

    Of course you need to check all the data, how else would you do it?

    However your implementation is broken. The default hash function for an array hashes the handle to the array itself, so different instances of arrays with the same data will show up as different. What you want to do is to use a HashCode instance and Add() each element of your array in it to get a proper hash code.