Search code examples
pythondictionary

Data Classes vs Dictionaries


I've been learning about dataclasses, and was reworking an old project trying to integrate a dataclass into the program in place of a dictionary system I was using. The code blocks below are essentially the respective new and old methods being used to build a dataframe of several thousand items. My problem is I don't understand the use-case for the dataclass over a dictionary.

What I want to know is:

  1. When should I use a dataclass over a dictionary (or vice versa)?

  2. Programmatically, in this instance of simply cataloguing data, is either method more efficient/optimized than the other?

  3. In actual practice is either method encouraged over the other (for reasons of efficiency, readibility, industrial standards, or otherwise)?

Method using @dataclass

@dataclass
class Car:
    year: int = None
    model: str = None

def main():
    foo = {}
    for name in car_list:
        bar = Car()
        bar.year = get_year(name)
        bar.model = get_model(name)
        
        foo[name] = vars(bar)

    df = pd.DataFrame.from_dict(foo)

Method using Dictionary

def main():
    foo = {}
    for name in car_list:

        bar = {
            'year': None
            'model': None
        }

        bar['year'] = get_year(name)
        bar['model'] = get_model(name)
        
        foo[name] = bar

    df = pd.DataFrame.from_dict(foo)

Solution

  • As discussed in the comments, there is a lot of discussion (and opinions) regarding this particular comparison. After doing several hours of research, there are a few main points I'd like to lay out for anyone else who may have this question in the future.

    1. In regard to efficiency

    Dictionaries are simpler data containers and thus will be more efficient. Under the hood, classes and dataclasses are dictionaries with a bit more going on. The top answer on this SO post provides insight into how much more efficient dictaries are than dataclasses when undergoing various tasks. (Creating a container can be as 5x as slow, whereas accessing the data is only 1.25-1.2 times as slow). Various other accounts on the web demonstrate similar results.

    2. Functional Differences

    A major point aside from speed is the control over mutability. It's not impossible to make elements of a dictionary immutable, but generally requires the creation of classes, functions, or importing some library. Dataclasses on the other hand allow instances to be frozen after creation by simply passing frozen=True into the decorator of the dataclass. On top of the obvious changing of values, this functionality also prevents any attributes from being added, accidentally or otherwise.

    Other decorator arguments provide potentially even finer control over class creation. This video is an excellent, beginner friendly resource that demonstrates several attributes and usecases of dataclasses.

    Type hints are another reason one might prefer to use dataclasses. While type hints can be utilized with dictionaries and their values, type hinting an object may result in finer control. This is a great write-up on Medium about a team who refactored a project to use dataclasses instead of dictionaries.

    3. Which one is right for me?

    I've spent the better half of an afternoon learning why it's hard to find an answer to this question. Because it depends. If one was objectively better than the other, the other would have been depreciated. Dictionaries are simpler- they require no imports, can be created, accessed and mutated with ease, and will produce faster results. Dataclasses on the other hand allow the user finer control. This can be especially important when working on large projects with several team members.

    A general heuristic when designing a program is to defer to the simplest structure when possible. In my particular case, I don't need the additional functionality of dataclasses when my goal is simply to create a dataframe, and my input data is somewhat reliable. Using a data class doesn't noticeably slow my code down, but if I were to take on several more inputs, I might see performance take a hit.