Search code examples
data-oriented-design

What is data oriented design?


I was reading this article, and this guy goes on talking about how everyone can greatly benefit from mixing in data oriented design with OOP. He doesn't show any code samples, however.

I googled this and couldn't find any real information as to what this is, let alone any code samples. Is anyone familiar with this term and can provide an example? Is this maybe a different word for something else?


Solution

  • First of all, don't confuse this with data-driven design.

    My understanding of Data-Oriented Design (DOD) is that it is about organizing your data for efficient processing. Especially with respect to cache misses etc. Data-Driven Design on the other hand is about letting data control a lot of the behavior of your program (described very well by Andrew Keith's answer).

    Say you have ball objects in your application with properties such as color, radius, bounciness, position, etc.

    Object Oriented Approach

    In OOP you would describe balls like this:

    class Ball {
        Point  position;
        Color  color;
        double radius;
    
        void draw();
    };
    

    And then you would create a collection of balls like this:

    vector<Ball> balls;
    

    Data-Oriented Approach

    In Data Oriented Design, however, you are more likely to write the code like this:

    class Balls {
        vector<Point>  position;
        vector<Color>  color;
        vector<double> radius;
    
        void draw();
    };
    

    As you can see there is no single unit representing one Ball anymore. Ball objects only exist implicitly.

    This can have many advantages, performance-wise. Usually, we want to do operations on many balls at the same time. The hardware usually wants large contiguous chunks of memory to operate efficiently.

    Secondly, you might do operations that affect only part of the properties of a ball. For E.g. if you combine the colors of all the balls in various ways, then you want your cache to only contain color information (which DOD allows). However, when all ball properties are stored in one unit (like OOP) you will pull in all the other properties of a ball as well. Even though you don't need them.

    Cache Usage Example

    Say each ball takes up 64 bytes and a Point takes 4 bytes. A cache slot takes, say, 64 bytes as well. If I want to update the position of 10 balls, then:

    • In OOP, I have to pull in 10 x 64 = 640 bytes of memory into cache and get 10 cache misses.
    • In DOD however, I can pull the positions of the balls as separately (without pulling unit's other properties), that will only take 10 x 4 = 40 bytes. That fits in one cache fetch. Thus we only get 1 cache miss to update all the 10 balls.

    These numbers are arbitrary - I assume a cache block is bigger.

    But it illustrates how memory layout can have a severe effect on cache hits and thus performance. This will only increase in importance as the difference between CPU and RAM speed widens.

    How to layout the memory

    In my ball example, I simplified the issue a lot, because usually for any normal app you will likely access multiple variables together. E.g. position and radius will probably be used together frequently. Then your DOD structure should be:

    class Body {
        Point  position;
        double radius;
    };
    
    class Balls {
        vector<Body>  bodies;
        vector<Color>  color;
    
        void draw();
    };
    

    The reason you should do this is that if data used together are placed in separate arrays, there is a risk that they will compete for the same slots in the cache. Thus loading one will throw out the other.

    So compared to Object-Oriented programming, the classes you end up making are not related to the entities in your mental model of the problem. Since data is lumped together based on data usage, you won't always have sensible names to give your classes in Data-Oriented Design.

    Relation to relational databases

    The thinking behind Data-Oriented Design is very similar to how you think about relational databases. Optimizing a relational database can also involve using the cache more efficiently, although in this case, the cache is not CPU cache but pages in memory.

    • A good database designer will also likely split out infrequently accessed data into a separate table rather than creating a table with a huge number of columns where only a few of the columns are ever used.
    • Or, he might also choose to denormalize some of the tables (maybe into a single table), so that data don't have to be accessed from multiple locations on disk.

    Just like with Data-Oriented Design these choices are made by looking at what the data access patterns are and where the performance bottleneck is.