Search code examples
pythonimmutabilitypython-dataclasses

Is it possible to prevent reading from a frozen python dataclass?


I have a situation where I would like to be able to treat a frozen dataclass instance as always having the latest data. Or in other words, I'd like to be able to detect if a dataclass instance has had replace called on it and throw an exception. It should also only apply to that particular instance, so that creation/replacements of other dataclass instances of the same type do not affect each other.

Here is some sample code:

from dataclasses import dataclass, replace

@dataclass(frozen=True)
class AlwaysFreshData:
    fresh_data: str


def attempt_to_read_stale_data():
    original = AlwaysFreshData(fresh_data="fresh")
    unaffected = AlwaysFreshData(fresh_data="not affected")

    print(original.fresh_data)

    new = replace(original, fresh_data="even fresher")

    print(original.fresh_data) # I want this to trigger an exception now

    print(new.fresh_data)

The idea here is to prevent both accidental mutation and stale reads from our dataclass objects to prevent bugs.

Is it possible to to do this? Either through a base class or some other method?

EDIT: The intention here is to have a way of enforcing/verifying "ownership" semantics for dataclasses, even if it is only during runtime.

Here is a concrete example of a situation with regular dataclasses that is problematic.

@dataclass
class MutableData:
    my_string: str

def sneaky_modify_data(data: MutableData) -> None:
    some_side_effect(data)
    data.my_string = "something else" # Sneaky string modification

x = MutableData(my_string="hello")

sneaky_modify_data(x)

assert x.my_string == "hello" # as a caller of 'sneaky_modify_data', I don't expect that x.my_string would have changed!

This can be prevented by using frozen dataclasses! But then there is still a situation that can lead to potential bugs, as demonstrated below.

@dataclass(frozen=True)
class FrozenData:
    my_string: str

def modify_frozen_data(data: FrozenData) -> FrozenData:
   some_side_effect(data)
   return replace(data, my_string="something else")

x = FrozenData(my_string="hello")

y = modify_frozen_data(x)

some_other_function(x) # AHH! I probably wanted to use y here instead, since it was modified!

In summary, I want the ability to prevent sneaky or unknown modifications to data, while also forcing invalidation of data that has been replaced. This prevents the ability to accidentally use data that is out-of-date.

This situation might be familiar to some as being similar to the ownership semantics in something like Rust.

As for my specific situation, I already have a large amount of code that uses these semantics, except with NamedTuple instances instead. This works, because modifying the _replace function on any instance allows the ability to invalidate instances. This same strategy doesn't work as cleanly for dataclasses as dataclasses.replace is not a function on the instances themselves.


Solution

  • I'd agree with Jon that keeping a proper inventory of your data and updating shared instances would be a better way to go about the problem, but if that isn't possible or feasible for some reason (that you should seriously examine if it is really important enough), there is a way to achieve what you described (good mockup, by the way). It will require a little non-trivial code though, and there are some constraints on your dataclass afterwards:

    from dataclasses import dataclass, replace, field
    from typing import Any, ClassVar
    
    
    @dataclass(frozen=True)
    class AlwaysFreshData:
        #: sentinel that is used to mark stale instances
        STALE: ClassVar = object()
    
        fresh_data: str
        #: private staleness indicator for this instance
        _freshness: Any = field(default=None, repr=False)
    
        def __post_init__(self):
            """Updates a donor instance to be stale now."""
    
            if self._freshness is None:
                # is a fresh instance
                pass
            elif self._freshness is self.STALE:
                # this case probably leads to inconsistent data, maybe raise an error?
                print(f'Warning: Building new {type(self)} instance from stale data - '
                      f'is that really what you want?')
            elif isinstance(self._freshnes, type(self)):
                # is a fresh instance from an older, now stale instance
                object.__setattr__(self._freshness, '_instance_freshness', self.STALE)
            else:
                raise ValueError("Don't mess with private attributes!")
            object.__setattr__(self, '_instance_freshness', self)
    
        def __getattribute__(self, name):
            if object.__getattribute__(self, '_instance_freshness') is self.STALE:
                raise RuntimeError('Instance went stale!')
            return object.__getattribute__(self, name)
    

    Which will behave like this for your test code:

    # basic functionality
    >>> original = AlwaysFreshData(fresh_data="fresh")
    >>> original.fresh_data
    fresh
    >>> new = replace(original, fresh_data="even fresher")
    >>> new.fresh_data
    even_fresher
    
    # if fresher data was used, the old instance is "disabled"
    >>> original.fresh_data
    Traceback (most recent call last):
      File [...] in __getattribute__
        raise RuntimeError('Instance went stale!')
    RuntimeError: Instance went stale!
    
    # defining a new, unrelated instance doesn't mess with existing ones
    >>> runner_up = AlwaysFreshData(fresh_data="different freshness")
    >>> runner_up.fresh_data
    different freshness
    >>> new.fresh_data  # still fresh
    even_fresher
    >>> original.fresh_data  # still stale
    Traceback (most recent call last):
      File [...] in __getattribute__
        raise RuntimeError('Instance went stale!')
    RuntimeError: Instance went stale!
    

    One important thing to note is that this approach introduces a new field to the dataclass, namely _freshness, which can potentially be set by hand and mess up the whole logic. You can try to catch it in __post_init__, but something like this would be a valid sneaky way to have an old instance stay fresh:

    >>> original = AlwaysFreshData(fresh_data="fresh")
    # calling replace with _freshness=None is a no-no, but we can't prohibit it
    >>> new = replace(original, fresh_data="even fresher", _freshness=None)
    >>> original.fresh_data
    fresh
    >>> new.fresh_data
    even_fresher
    

    Additionally, we need a default value for it, which means that any fields declared below it also need a default value (which isn't too bad - just declare those fields above it), including all fields from future children (this is more of a problem, and there is a huge post on how to handle such a scenario).

    You also need a sentinel value available whenever you use this kind of pattern. This is not really bad, but it might be a strange concept to some people.