So I am solving a lot of Advent of Code tasks these days, and I set myself the added challenge of writing the code following as many best practices as possible. In particular, this means using typing
, making the code as DRY as possible, and separating the data structures from the logical structures. However, I am running into a little bit of a problem.
Essentially, let me lay out the parts of the code that certainly need to be written and written only once. These are
Data_i
where i
is an integer between 1 and 25.i
, a method for parsing Data_i
from a file. Let's assume, for the sake of argument, that this method is load_data_i
.function_i_j
where i
is an integer between 1 and 25 and i
is either 1 or 2. All functions return a string, and for each i
, the function function_i_j
accepts an instance of type Data_i
.Very basically, the code I could then write to handle a particular problem would be something like this:
def solve(problem_number, task_number):
g = globals()
g[f'function{problem_number}_{task_number}'](g[f'load_data_{problem_number}']())
however this, while quite DRY, is all sorts of hacky and ugly, and not really conducive to type hinting.
Some other ideas I had were:
Solver
class with abstract methods function_1
and function_2
, and a method solve
that just calls one of the two abstract methods. Then have 25 classes that inherit from Solver
. The problem here is that each class inheriting from Solver
will accept a different data type.Solver
class that also has data
part of each solver, but that violates separating data from logic.I feel more at home in c++
, where the above problem could be solved by making function_i_j
a templated class, and then explicitly instantiating it for the 25 data types.
Now, my two questions:
Minimum example with only two data types:
Data1 = str
Data2 = float
def load_data_1(file_path: Path):
with open(file_path) as f:
return f.readlines()
def load_data_2(file_path: Path):
with open(file_path) as f:
return float(f.readline())
def function_1_1(data: Data01) -> str:
return data.strip()
def function_1_2(data: Data01) -> str:
return data.upper()
def function_2_1(data: Data02) -> str:
return f'{data < 0}'
def function 2_2(data: Data02) -> str:
return f'{data > 3.16}'
def main(problem_number: int, version_number: int) -> None:
g = globals()
function_to_call = g[f'function{problem_number}_{task_number}']
data_loader = g[f'load_data_{problem_number}']
data_path = f'/path/to/data_{problem_number}.txt'
print(function_to_call(data_loader(data_path)))
There is no equivalent to what you describe as a templated function in Python.
Getting an object by name dynamically (i.e. at runtime) will always make it impossible to infer its type for a static type checker. A type checker does not execute your code, it just reads it.
There are a few patterns and workarounds available to achieve code that more or less satisfies your constraints.
Here is how I understand the problem.
We are given N distinct data schemas (with N ≈ 25). Each of those schemas should be represented by its own data class. These will be our data types.
There should be a distinct function for each of our data classes that loads a file and parses its contents into an instance of that data class. We'll refer to them as our load functions. The logic for each of those load functions is given; they should all accept a file path and return an instance of their corresponding data class. There will consequently be N load functions.
For each data type, we are given M distinct algorithms (with M ≈ 2). Each of these algorithms shall have its own function that takes an instance of its corresponding class and returns a string. We'll call them our solver functions. Thus, there will be a total of N × M solver functions.
Each data type will be encoded with an integer i
between 1 and N, which we will call the problem number. Each solver function for a given data type (i.e. for a given problem number) will be encoded with an integer j
between 1 and M, which we will call the version number.
We are given N different files of data, each corresponding to a different data type. All the files reside in the same directory and are named data_i.txt
, where i
stands for its corresponding problem number.
The input to our main program will be two integers i
and j
.
The task is to load the i
-th data file from disk, parse it into the corresponding data class' instance via its matching load function, call the j
-th solver function defined for that data type on that instance, and print its output.
Where any of these constraints stand in conflict with one another, we should strive for a reasonable balance between them.
Three files in one package (+ __init__.py
):
data.py
containing data class definitions (and related code)solver.py
containing the solver functions (and related code)main.py
with the main function/scriptI may reduce the number of blank lines/line breaks below what is typically suggested in style guides in the following to improve readability (reduce scrolling) on this site.
data
moduleEverything, literally everything (aside from keywords like if
or def
) in Python is an object and thus an instance of a class. Without further information we can assume that data of a certain schema can be encapsulated by an instance of a class. Python's standard library for example provides the dataclasses
module that may be useful in such situations. Very good third-party libraries exist, too.
To utilize the benefits that object-oriented programming provides, honor the DRY principle, and to improve code-reuse, and type clarity among other things, we can define one base data class that all our N data classes will inherit from.
Since the load function has an intimate 1:1 relationship with our data type, it is entirely reasonable to make it a method of our data classes. Since the logic is different for each individual data class, but each of them will have one, this is the perfect use case for abstract base classes (ABC) and the abstractmethod
decorator provided by the abc
module. We can define our base class as abstract and force any subclass to implement a load
method, aside from its own data fields of course.
data.py
from __future__ import annotations
from abc import ABC, abstractmethod
from dataclasses import dataclass
from pathlib import Path
from typing import TypeVar
__all__ = [
"AbstractData",
"Data1",
"Data2",
# ...
]
D = TypeVar("D", bound="AbstractData")
class AbstractData(ABC):
@classmethod
@abstractmethod
def load(cls: type[D], file_path: Path) -> D: ...
@dataclass
class Data1(AbstractData):
x: str
@classmethod
def load(cls, file_path: Path) -> Data1:
with file_path.open("r") as f:
return Data1(x=f.readline())
@dataclass
class Data2(AbstractData):
y: float
@classmethod
def load(cls, file_path: Path) -> Data2:
with file_path.open("r") as f:
return Data2(y=float(f.readline()))
...
To be able to express that the type of the Data1.load
class method is as subtype of AbstractData.load
, we annotate the latter with a type variable in such a way that a type checker expects the output of that method to be of the specific type that it binds to (i.e. cls
). That type variable further receives an upper bound of AbstractData
to indicate that not any type object is valid in this context, but only subtypes of AbstractData
.
solver
moduleIntroduce a base solver class and a subclass for each problem number. Regarding abstractness and inheritance, the same ideas apply.
The difference this time is that we can make the base solver class generic in terms of the data class it deals with. This allows us (with a few tricks) to minimize code, while maintaining type safety.
A solver will have an attribute that can hold a reference to an instance of its corresponding data class. When initializing a solver, we can provide the path to a data file to immediately load and parse the data and save an instance of its data class in that attribute of the solver. (And/Or we can load it later.)
We will write a get_solver
function that takes the problem number as its argument and returns the corresponding solver class. It will still use the approach of fetching it from the globals()
dictionary, but we will make this as type safe, runtime safe, and clean as possible (given the situation).
To have knowledge of the narrowest possible type, i.e. the concrete solver subclass returned by get_solver
, we will have no choice but to use the Literal
+overload
pattern. And yes, that means N distinct signatures for the same function. (Notice the trade-off "DRY vs. type safe" here.)
solver.py
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Generic, Literal, TypeAlias, TypeVar
from typing import get_args, get_origin, overload
from .data import *
__all__ = [
"AbstractBaseSolver",
"Solver1",
"Solver2",
"ProblemNumT",
"VersionNumT",
"get_solver",
]
D = TypeVar("D", bound=AbstractData)
class AbstractBaseSolver(ABC, Generic[D]):
_data_type: type[D] | None = None # narrowed in specified subclasses
_data: D | None = None # narrowed via instance property
@classmethod
def __init_subclass__(cls, **kwargs: object) -> None:
"""
Initializes a subclass and narrows the `_data_type` attribute on it.
It does this by identifying this specified class among all original
base classes and extracting the provided type argument.
Details: https://stackoverflow.com/questions/73746553/
"""
super().__init_subclass__(**kwargs)
for base in cls.__orig_bases__: # type: ignore[attr-defined]
origin = get_origin(base)
if origin is None or not issubclass(origin, AbstractBaseSolver):
continue
type_arg = get_args(base)[0]
# Do not set the attribute for GENERIC subclasses!
if not isinstance(type_arg, TypeVar):
cls._data_type = type_arg
return
@classmethod
def get_data_type(cls) -> type[D]:
if cls._data_type is None:
raise AttributeError(
f"{cls.__name__} is generic; type argument unspecified"
)
return cls._data_type
def __init__(self, data_file_path: Path | None = None) -> None:
if data_file_path is not None:
self.load_data(data_file_path)
def load_data(self, file_path: Path) -> None:
self._data = self.get_data_type().load(file_path)
@property
def data(self) -> D:
if self._data is None:
raise AttributeError("No data loaded yet")
return self._data
@abstractmethod
def function_1(self) -> str:
...
@abstractmethod
def function_2(self) -> str:
...
class Solver1(AbstractBaseSolver[Data1]):
def function_1(self) -> str:
return self.data.x.strip()
def function_2(self) -> str:
return self.data.x.upper()
class Solver2(AbstractBaseSolver[Data2]):
def function_1(self) -> str:
return str(self.data.y ** 2)
def function_2(self) -> str:
return self.data.y.hex()
ProblemNumT: TypeAlias = Literal[1, 2]
VersionNumT: TypeAlias = Literal[1, 2]
@overload
def get_solver(problem_number: Literal[1]) -> type[Solver1]:
...
@overload
def get_solver(problem_number: Literal[2]) -> type[Solver2]:
...
def get_solver(problem_number: ProblemNumT) -> type[AbstractBaseSolver[D]]:
cls_name = f"Solver{problem_number}"
try:
cls = globals()[cls_name]
except KeyError:
raise NameError(f"`{cls_name}` class not found") from None
assert isinstance(cls, type) and issubclass(cls, AbstractBaseSolver)
return cls
That whole __init_subclass__
/ get_data_type
hack is something I explain in more detail here. It allows utilizing the (specific) type argument passed to __class_getitem__
, when we subclass AbstractBaseSolver
, at runtime. This allows us to only write the code for instantiating, loading and accessing the data class instance once, but still remain entirely type safe with it across all subclasses. The idea is to only write the function_1
/function_2
methods on each subclass after specifying the type argument and nothing else.
The code inside the function_
-methods is obviously just for demo purposes, but it again illustrates type safety across the board quite nicely.
To be perfectly clear, the ProblemNumT
type alias will need to be expanded to the number of problems/data types, i.e. Literal[1, 2, 3, 4, 5, ...]
. The call signature for get_solver
will likewise need to be written out N times. If anyone has a better idea than repeating the overload
ed signature 25 times, I am eager to hear it, as long as the annotations remain type safe.
The actual implementation of get_solver
is cautious with the dictionary lookup and transforms the error a bit to keep it in line with the typical Python behavior, when a name is not found. The last assert
is for the benefit of the static type checker, to convince it that what we are returning is as advertised, but it is likewise an assurance for us at runtime that we did not mess up along the way.
main
moduleNot much to say here. Assuming two function versions for each solver/data type, the if
-statements are totally fine. If that number increases, well ... you get the idea. What is nice is that we know exactly which solver we get, depending on the integer we pass to get_solver
. All the rest is also safe and pretty much self-explanatory:
main.py
from pathlib import Path
from .solver import ProblemNumT, VersionNumT, get_solver
DATA_DIR_PATH = Path(__file__).parent
def main(problem_number: ProblemNumT, version_number: VersionNumT) -> None:
solver_cls = get_solver(problem_number)
data_file_path = Path(DATA_DIR_PATH, f"data_{problem_number}.txt")
solver = solver_cls(data_file_path)
if version_number == 1:
print(solver.function_1())
elif version_number == 2:
print(solver.function_2())
else:
raise ValueError("Version number must be 1 or 2")
if __name__ == "__main__":
main(1, 2)
main(2, 1)
If we put a data_1.txt
with foo
in its first line and a data_2.txt
with 2.0
in its first line into the package's directory and run the script with python -m package_name.main
, the output will be as expected:
FOO 4.0
There are no complaints from mypy --strict
about that package.
This is the best I could come up with after the little back and forth in the comments. If this illustrates a grave misunderstanding, feel free to point it out. It still seems to me that your question is very broad and allows a lot of room for interpretation, which makes pedants like me uncomfortable. I don't consider myself an expert, but I hope this still illustrates a few patterns and tricks that Python offers, when trying to write clean code.