Search code examples
c#objectuuidguid

Object to GUID/UUID


I want to take any object and get a guid that represents that object.

I know that entails a lot of things. I am looking for a good-enough solution for common applications.

My specific use case is for caching, I want to know that the object used to create the thing I am caching has already made one in the past. There would be 2 different types of objects. Each type contains only public properties, and may contain a list/ienumable.

Assuming the object could be serializable my first idea was to serialize it to json (via native jsonserlizer or newtonsoft) and then take the json string and convert that to a uuid version 5 as detailed in a gist here How can I generate a GUID for a string?

My second approach if it's not serializable ( for example contained a dictionary ) would be to use reflection on the public properties to generate a unique string of some sort and then convert that to uuid version 5.

Both approaches use uuid version 5 to take a string to guid. Is there a proven c# class that makes valid uuid 5 guids? The gist looks good but want to be sure.

I was thinking of making the c# namespace and type name be the namespace for the uuid 5. Is that a valid use of namespace ?

My first approach is good enough for my simple use case but I wanted to explore the second approach as it's more flexible.

If creating the guid couldn't guarantee reasonable uniqueness it should throw an error. Surely super complicated objects would fail. How might I know that is the case if using reflection?

I am looking for new approaches or concerns/implementations to the second approach.


Edit: The reason why I bounty/reopened this almost 3 years later is because I need this again (and for caching again); but also because of the introduction of the generic unmanaged constraint in c# 7.3. The blog post at http://devblogs.microsoft.com/premier-developer/dissecting-new-generics-constraints-in-c-7-3/ seems to suggest that if the object can obey the unmanaged spec you can find a suitable key for a key-value store. Am I misunderstanding something?

This is still limited because the object (generic) must obey the unmanaged type constraint which is very limiting (no strings, no arrays, etc), but its one step closer. I don't completely understand why the method of getting the memory stream and getting a sha1 hash cant be done on not unmanaged typed.

I understand that reference types are pointing to places in memory and its not as easy to get the memory that represents all whole object; but it feels doable. After all, objects eventually are made up a bunch of implementations of unmanaged types (string is an array chars, etc)

PS: The requirement of GUID is loose, any integer/string at or under 512 bits would suffice


Solution

  • The problem of equality is a difficult one.
    Here some thoughts on how you could solve your problem.

    Hashing a serialized object
    One method would be to serialize an object and then hash the result as proposed by Georg.
    Using the md5 checksum gives you a strong checksum with the right input.
    But getting it right is the problem.

    You might have trouble using a common serialization framework, because:

    • They don't care whether a float is 1.0 or 1.000000000000001.
    • They might have a different understanding about what is equal than you / your employer.
    • They bloat the serialized text with unneeded symbols. (performance)
    • Just a little deviation in the serialized text causes a large deviation in the hashed GUID/UUID.

    That's why, you should carefully test any serialization you do.
    Otherwise you might get false possitives/negatives for objects (mostly false negatives).

    Some points to think about:

    • Floats & Doubles:
      Always write them the same way, preferably with the same number of digits to prevent something like 1.000000000000001 vs 1.0 from interfering.
    • DateTime, TimeStamp, etc.:
      Apply a fixed format that wont change and is unambiguous.
    • Unordered collections:
      Sort the data before serializing it. The order must be unambiguous
    • Strings:
      Is the equality case-sensitive? If not make all the strings lower or upper case.
      If necessary, make them culture invariant.
    • More:
      For every type, think carefully what is equal and what is not. Think especially about edge cases. (float.NaN, -0 vs 0, null, etc.)

    It's up to you whether you use an existing serializer or do it yourself.
    Doing it yourself is more work and error prone, but you have full control over all aspects of equality and serialization.
    Using an existing serializer is also error prone, because you need to test or prove whether the results are always like you want.


    Introducing an unambiguous order and use a tree
    If you have control over the source code, you can introduce a custom order function.
    The order must take all properties, sub objects, lists, etc. into account. Then you can create a binary tree, and use the order to insert and lookup objects.

    The same problems as mentioned by the first approach still apply, you need to make sure that equal values are detected as such. The big O performance is also worse than using hashing. But in most real live examples, the actual performance should be comparable or at least fast enough.

    The good thing is, you can stop comparing two objects, as soon as you found a property or value that is not equal. Thus no need to always look at the whole object. A binary tree needs O(log2(n)) comparisons for a lookup, thus that would be quite fast.

    The bad thing is, you need access to all actual objects, thus keep them in memory. A hashtable needs only O(1) comparisons for a lookup, thus would even be faster (theoretically at least).


    Put them in a database
    If you store all your objects in a database, then the database can do the lookup for you.
    Databases are quite good in comparing objects and they have built in mechanisms to handle the equality/near equality problem.

    I'm not a database expert, so for this option, someone else might have more insight on how good this solution is.