Search code examples
c++optimizationmemorystructcache-control

C++ Cache friendly way of accessing all the members of all elements of a `vector <struct_type>`


I'm interested in optimizing my code for multithreaded computing. In terms of the cache, pipelining, or any other aspects of memory access, how do the following compare for conserving those resources:

Case 1

struct something{
    float a;
    float b;
    int c;
    bool d;
};

vector <something> vec(n, something());

for(int q=0; q<n; q++)
    {
         vec[q].a = expression1;
         vec[q].b = expression2;
         vec[q].c = expression3;
         vec[q].d = expression4;
    } 

Case 2

struct something{
    float a;
    float b;
    int c;
    bool d;
};

vector <something> vec(n, something());

for(int q=0; q<n; q++)
    vec[q].a = expression1;
for(int q=0; q<n; q++)
    vec[q].b = expression2;
for(int q=0; q<n; q++)
    vec[q].c = expression3;
for(int q=0; q<n; q++)
    vec[q].d = expression4;

Case 3

vector <float> a(n);
vector <float> b(n);
vector <int>   c(n);
vector <bool>  d(n); 

for(int q=0; q<n; q++)
    a[q] = expression1;
for(int q=0; q<n; q++)
    b[q] = expression2;
for(int q=0; q<n; q++)
    c[q] = expression3;
for(int q=0; q<n; q++)
    d[q] = expression4;

Also, are there better ways of approaching the above?


Solution

    • Case 1 is the most readable.
    • Case 1 and case 3 are equally cache friendly. Both make only one pass through all the data.*
    • Case 2 is the worst because it makes 4 passes over the data - each pass only touching one element.

    If all the struct fields are different, then case 3 has a huge advantage of possibly being vectorizable while case 1 doesn't.

    The reason for this is because case 3 is the struct of arrays packing that puts all the same datatypes together sequentially in memory - thereby exposing vectorization.

    EDIT :

    *Case 3 is potentially even more cache friendly than case 1 because it doesn't need struct-padding - so the data size is smaller.