I'm trying to write a parser and I'm missing something in the dataclasses usage.
I'm trying to be as generic as possible and to do the logic in the parent class but every child has the sames values in the end.
I'm confused with what dataclasse decorator do with class variables and instances variables.
I should probably not use self.__dict__
in my post_init.
How would you do to have unique instances using the same idea ?
from dataclasses import dataclass
class VarSlice:
def __init__(self, start, end):
self.slice = slice(start, end)
self.value = None
@dataclass
class RecordParser():
line: str
def __post_init__(self):
for k, var in self.__dict__.items():
if isinstance(var, VarSlice):
self.__dict__[k].value = self.line[var.slice]
@dataclass
class HeaderRecord(RecordParser):
sender : VarSlice = VarSlice(3, 8)
k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)
Result :
45678
45678
Expected result is :
abcde
45678
I tried changing VarSlice
to a dataclass too but it changed nothing.
This curious behavior is observed, since when you do:
sender: VarSlice = VarSlice(3, 8)
The default value here is a specific instance VarSlice(3, 8)
- which is shared between all HeaderRecord
instances.
This can be confirmed, by printing the id
of the VarSlice
object - if they are the same when constructing an instance of a RecordParser
subclass more than once, then we have a problem:
if isinstance(var, VarSlice):
print(id(var))
...
This is very likely not what you want.
The desired behavior is likely going to be create a new VarSlice(3, 8)
instance, each time a new HeaderRecord
object is instantiated.
To resolve the issue, I would suggest to use default_factory
instead of default
, as this is the recommended (and documented) approach for fields with mutable default values.
i.e.,
sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))
instead of:
sender: VarSlice = VarSlice(3, 8)
The above, being technically equivalent to:
sender: VarSlice = field(default=VarSlice(3, 8))
Full code with example:
from dataclasses import dataclass, field
class VarSlice:
def __init__(self, start, end):
self.slice = slice(start, end)
self.value = None
@dataclass
class RecordParser:
line: str
def __post_init__(self):
for var in self.__dict__.values():
if isinstance(var, VarSlice):
var.value = self.line[var.slice]
@dataclass
class HeaderRecord(RecordParser):
sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))
k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)
Now prints:
defgh
45678
Though clearly this is not a bottleneck, when creating multiple instances of a RecordParser
subclass, I note there could be areas for potential improvement.
Reasons that performance could be (slightly) impacted:
for
loop on each instantiation to iterate over dataclass fields which are of a specified type VarSlice
, where a loop could potentially be avoided.__dict__
attribute on the instance is accessed each time, which can also be avoided. Note that using dataclasses.fields()
instead is actually worse, as this value is not cached on a per-class basis.isinstance
check is run on each dataclass field, each time a subclass is instantiated.To resolve this, I could suggest improving performance by statically generating a __post__init__()
method for the subclass via dataclasses._create_fn()
(or copying this logic to avoid dependency on an "internal" function), and setting it on the subclass, i.e. before the @dataclass
decorator runs for the subclass.
An easy way could be to utilize the __init_subclass__()
hook which runs when a class is subclassed, as shown below.
# to test when annotations are forward-declared (i,e. as strings)
# from __future__ import annotations
from collections import deque
from dataclasses import dataclass, field, _create_fn
class VarSlice:
def __init__(self, start, end):
self.slice = slice(start, end)
self.value = None
@dataclass
class RecordParser:
line: str
def __init_subclass__(cls, **kwargs):
# list containing the (dynamically-generated) body lines of `__post_init__()`
post_init_lines = deque()
# loop over class annotations (this is a greatly "simplified"
# version of how the `dataclasses` module does it)
for name, tp in cls.__annotations__.items():
if tp is VarSlice or (isinstance(tp, str) and tp == VarSlice.__name__):
post_init_lines.append(f'var = self.{name}')
post_init_lines.append('var.value = line[var.slice]')
# if there are no dataclass fields of type `VarSlice`, we are done
if post_init_lines:
post_init_lines.appendleft('line = self.line')
cls.__post_init__ = _create_fn('__post_init__', ('self', ), post_init_lines)
@dataclass
class HeaderRecord(RecordParser):
sender: VarSlice = field(default_factory=lambda: VarSlice(3, 8))
k = HeaderRecord(line="abcdefgh")
kk = HeaderRecord(line="123456789")
print(k.sender.value)
print(kk.sender.value)