I have a data class as follows:
from dataclasses import dataclass, field
from typing import Any, Dict
raw_dir = r"C:..." # path of the raw dir
processed_dir = r"C:..." # path of the processed dir
@dataclass
class Files:
raw_path: Path = Path(raw_dir)
processed_path: Path = Path(processed_dir)
path_dict: Dict[str, Any] = field(
default_factory=lambda: {
"raw_train_file": Path(raw_path, "raw_train.csv"),
"processed_train_file": Path(processed_path, "processed_train.csv"),
}
)
Files().path_dict
This will throw an error name "raw_path" is not defined.
But when you try to print raw_path
right after the first line, it can done and hence the problem may be from the path_dict
. I tried replacing the key-value pair to "raw": Path(directory)
and it worked so I do not think it is the issue with the data type.
Context: I treat the dataclass
as a config
file (func) such that when I need to call a default path, I can just use:
pd.read_csv(Files().path_dict["raw_train_file"])
Your problem is that the default_factory has to be a zero-argument callable. Because of that, it cannot use any member variable. Here, as the member variables have trivial initialization, you can repeat that initialization, to only use global vars:
...
path_dict: Dict[str, Any] = field(
default_factory=lambda: {
"raw_train_file": Path(Path(raw_dir), "raw_train.csv"),
"processed_train_file": Path(Path(processed_dir), "processed_train.csv"),
}
But you can also use the special __post_init__
method which is called by the generated __init__
after the other initialization. As it receive the self
argument, it can use member variables:
@dataclass
class Files:
raw_path: Path = Path(raw_dir)
processed_path: Path = Path(processed_dir)
def __post_init__(self):
self.path_dict: Dict[str, Any] = {
"raw_train_file": Path(self.raw_path, "raw_train.csv"),
"processed_train_file": Path(self.processed_path, "processed_train.csv"),
}