Search code examples
pythonhashpython-dataclasses

Include Type in a Dataclass Hash Function


I have a hierarchy of frozen dataclasses, and I need to implement hashing for these dataclasses such that the hashes of every unique instance across the whole hierarchy are unique. I am defining "unique" here to mean that either the fields or the type of two instances differ. However, since the default dataclass __hash__ is a function only of the dataclass fields and not of the type, instances of different dataclass types which share the same fields hash to the same value by default.

Simplified Example

Below is a 3-level nested dataclass.

from dataclasses import dataclass
import abc

@dataclass(frozen=True)
class Element(abc.ABC):
    pass

@dataclass(frozen=True)
class StepType(Element, abc.ABC):
    @classmethod
    def name(cls):
        return cls.__name__

class Skip(StepType): pass
class Hop(StepType): pass

@dataclass(frozen=True)
class Stepper(Element, abc.ABC):
    step_type: StepType = Skip()
    foo: int = 1
    @abc.abstractmethod
    def step(self):
        pass

@dataclass(frozen=True)
class Single(Stepper):
    def step(self):
        return self.step_type.name() + " once"

@dataclass(frozen=True)
class Double(Stepper):
    def step(self):
        return self.step_type.name() + " twice"

@dataclass(frozen=True)
class Speed(Element, abc.ABC):
    @abc.abstractmethod
    def how_fast(self):
        pass

@dataclass(frozen=True)
class Slow(Speed):
    def how_fast(self):
        return "real slow"

@dataclass(frozen=True)
class Fast(Speed):
    def how_fast(self):
        return "quickly"

@dataclass(frozen=True)
class Walker:
    speed: Speed = Slow()
    stepper: Stepper = Single()

    def walk(self):
        return " ".join([self.stepper.step(), self.speed.how_fast()])

Inheritance Structure

Here's summary of the inheritance structure of Element.

Element
|- StepType
|  |- Skip
|  |- Hop
|
|- Stepper
|  |- Single
|  |- Double
|
|- Speed
   |- Slow
   |- Fast

Nesting Structure

Here's summary of the nested fields structure of Walker.

Walker
|- stepper: Stepper
|  |- step_type: StepType
| 
|- speed: Speed

Tested Behavior

Different instances of Walker and its Element instances hash to the same value.

a = Walker(speed=Slow(), stepper=Single())
b = Walker(speed=Fast(), stepper=Double(step_type=Hop()))
print(f"a: {hash(a)}\nb: {hash(b)}")
>> a: -5704360693866892300
   b: -5704360693866892300

Desired Behavior

I would like the hashes of unique Walker instances to be reliably different and maintain functionality for various dataclass features.

Constraints

  • This need not be done via overriding __hash__, it could be via creating new methods in Walker and/or Element. But it must be recursive since Element instances can contain other Elements to arbitrary depths. Ideally, I'd like to leverage the reliable default dataclass __hash__ rather than overriding it completely.
  • The hashes will be saved to and read from disk, so they must be consistent across runs. E.g., they can't depend on id(self) like object.__hash__ does.
  • Everything except Walker must inherit from Element. I cannot refactor any of the dataclasses into regular classes, Enums, etc. I must use the dataclass framework.

Ideas

I think it would work well to add repr(type(self)) to the hash function as well as the fields.

Idea 1

My first idea was something like:

class Element:
    def hash(self):
        return hash(self) ^ hash(repr(type(self)))

But this doesn't recurse down to nested Elements.

Idea 2

This is the best idea I have at the time of posting.

class Element:
    def __hash__(self):
        return hash(
            hash(repr(type(self))) ^ 
            (hash(
                hash((key, val)) for key, val in self.__dict__.items()
                )**3
            )
        )

This does recurse, and it passes my unit tests, but it overrides the built-in dataclass __hash__. I'm worried this could create some broken corner cases. Any input on a more robust method, pointing out where this would break, or a simple thumbs up is appreciated.


Solution

  • One approach to reusing dataclass' built-in hash function while adding the current class to the hash is to define a hidden field (hidden with init=False and repr=False) with the current class as the default value. This can be done by setting the field in the __init_subclass__ method of the base class, with the accompanying annotation of the hidden field set in the __annotations__ attribute:

    class Element:
        def __init_subclass__(cls, **kwargs):
            super().__init_subclass__(**kwargs)
            cls._type = field(init=False, repr=False, default=cls)
            cls.__annotations__['_type'] = type
    

    Demo here