Is there any way I can accelerate this function:
void task(int I, int J, int K, int *L, int **ids, double *bar){
double *foo[K];
for (int k=0;k<K;k++)
foo[k] = new double[I*L[k]];
// I am filling these arrays somehow
// This is not a bottleneck, hence omitted here
for (int i=0;i<I;i++)
for (int j=0;j<J;j++){
double tmp = 1.;
for (int k=0;k<K;k++)
tmp *= foo[k][i*L[k]+ids[j][k]]; //ids[j][k]<L[k]
bar[i*J+j] = tmp;
}
}
Typical values are: I = 100,000
, J = 10,000
, K=3
, L=[50,20,60]
.
I read that the __restrict__
keyword/extension could help, but am not sure how to apply it here. For example, trying to put it into the definition of foo[k] = new double[...]
I get error: '__restrict_ qualifiers cannot be applied to double
. Furthermore, I don't know whether I should / how I could declare ids
and ids[j], 1<= j<= J
as restricted.
As a note, in my actual code, I execute as such tasks in parallel in as many threads as my CPU has cores.
I am writing mostly C-compatible C++, so solutions in both languages are welcome.
https://en.cppreference.com/w/c/language/restrict claims you can declare an array of restrict
pointers to double like so in C99/C11:
typedef double *array_t[10];
restrict array_t foo; // the type of a is double *restrict[10]
But only gcc accepts that. I think this is a GCC-ism, not valid ISO C11. (gcc also accepts
array_t restrict foo_r;
but no other compilers accept that either.)
ICC warns "restrict" is not allowed
, clang rejects it with
<source>:16:5: error: restrict requires a pointer or reference ('array_t' (aka 'double *[10]') is invalid)
restrict array_t foo_r;
^
MSVC rejects it with error C2219: syntax error: type qualifier must be after '*'
We get essentially the same behaviour in C++ from these compilers with __restrict
, which they accept as a C++ extension with the same semantics as C99 restrict
.
As a workaround, you can instead use a qualified temporary pointer every time you read from foo
, instead of f[k][stuff]
. I think this promises that the memory you reference through fk
isn't the same memory you access through any other pointers within the block where fk
is declared.
double *__restrict fk = foo[k];
tmp *= fk[ stuff ];
I don't know how to promise the compiler that none of the f[0..K-1]
pointers alias each other. I don't think this accomplishes that.
You don't need __restrict here.
I added __restrict
to all the pointer declarations, like int *__restrict *__restrict ids
and it doesn't change the asm at all, according to a diff pane on the Godbolt compiler explorer: https://godbolt.org/z/4YjlDA. As we'd expect because type-based aliasing lets the compiler assume that a double
store into bar[]
doesn't modify any of the int *
elements of int *ids[]
. As people said in comments, there's no aliasing here that the compiler can't already sort out. And in practice it appears that it does sort it out, without any extra reloads of pointers.
It also can't alias *foo[k]
, because we got those pointers with new
inside this function. They can't be pointing inside bar[]
.
(All the major x86 C++ compilers (GCC,clang,ICC,MSVC) support __restrict
in C++ with the same behaviour as C99 restrict
: a promise to the compiler that stores through this pointer don't modify objects that are pointed to by another pointer.
I'd recommend __restrict
over __restrict__
, at least if you mostly want portability across x86 compilers. I'm not sure about outside of that.)
It looks like you're saying you tried to put __restrict__
into an assignment, not a declaration. That won't work, it's the pointer variable itself that __restrict
applies to, not a single assignment to it.
The first version of the question had a bug in the inner loop: it had K++
instead of k++
, so it was pure undefined behaviour and the compilers got weird. The asm didn't make any sense (e.g. no FP multiply instruction, even when foo[]
was a function arg). This is why it's a good idea to use a name like klen
instead of K
for an array dimension.
After fixing that on the Godbolt link, there's still no difference in the asm with / without __restrict
on everything, but it's a lot more sane.
BTW, making double *foo[]
a function arg would let us look at the asm for just the main loop. And you would actually need __restrict
because a store to bar[]
could modify an element of foo[][]
. This doesn't happen in your function because the compiler knows that new
memory isn't pointed-to by any existing pointers, but it wouldn't know that if foo
was a function arg.
There's a small amount of the work inside the loop is sign-extending 32-bit int
results before using them as array indices with 64-bit pointers. This adds a cycle of latency in there somewhere, but not the loop-carried FP multiply dependency chain so it may not matter. You can get rid of one instruction inside the inner loop on x86-64 by using size_t k=0;
as the inner-most loop counter. L[]
is a 32-bit array, so i*L[k]
needs to be sign-extended inside the loop. Zero-extension from 32 to 64-bit happens for free on x86-64, so i * (unsigned)L[k]
saves a movsx
instruction in the pointer-chasing dep chain. Then the inner loop that gcc8.2 makes is all necessary work, required by your nasty data structures / layout. https://godbolt.org/z/bzVSZ7
I don't know whether that's going to make a difference or not. I think more likely the memory access pattern causing cache misses will be your bottleneck with real data.
It also can't auto-vectorize because the data isn't contiguous. You can't get contiguous source data from looping over j
or i
, though. At least i
would be a simple stride without having to redo ids[j][k]
.
If you generate foo[k][...]
and bar[...]
transposed, so you index with foo[k][ i + L[k] * ids[j][k] ]
, then you'd have contiguous memory in src and dst so you (or the compiler) could use SIMD multiplies.