Search code examples
c++crtpexpression-templates

Expression Templates + CRTP + AMP == kernel generation


I have recently discovered the sheer awesomeness of expression templates and have come to a somewhat satisfactory level of understanding and skills in their usage, however I want to make a new use of the idiom. I would skip the lengthy story about how I arrived to this problem, but the question proves itself in terms of value.

I am trying to create similar basic expression classes to those found on wiki, but in a C++AMP compatible form, meaning that the operations are all done in C++AMP kernels. One can easily write wrapping classes to such large vector operations that have each and every elementary operation as a seperate kernel, but that is extremely inefficient. I'm trying to create wrapping expression template classes, that ultimately merge the operations into a single kernel.

Given the sample code on wiki, this would imply that inside the copy constructor of the Vec class, one would write

concurrency::parallel_for_each(vec.get_extent(), [&](index_type i) restrict(amp,cpu) {...});

instead of a regular for loop. The only problem with this is, that inside restrict(amp) functions, one can only use amp-compatible classes, which have restrictions described in section 2 of the C++AMP specification, most importantly under section 2.4. The biggest of restrictions are that a C++AMP compatible class cannot have references members, other than concurrency::array. This completely destroys the Expression Tempalte idiom (might be using the wrong word here), where operations are packed into each other, and they all hold references to the inner operands. Storing by value AFAIK is also not an option, because compilers only "see through" classes that do not have members other than (const) refereneces.

Is there any way to make this work, or find some alternate route that is fully host-side C++ and at some later point all cast to a C++AMP compatible construct? Ultimately I would like to be able to make wrapping classes that people without any knowledge of GPGPU could make EFFICIENT use of without me having to create a code-generation tool, instead of having the compiler to all the hard work.

Thanks in advance.

ps.: naturally index_type is concurrency::index<1> and container_type is either concurrency::array or concurrency::array_view, whichever helps solve the problem. array_view is cleaner logically, meaning that a Vec class is created using an array outside the class, and Vec only stores array_views into that array, however array_views are not allowed as reference members in any form, plus logically array should allow for more optimizations by the compiler, as opposed to having every operation operate on different array_views which might infact point to the same physical array.


Solution

  • In case anyone would come across this highly outdated question, I found a solution to the initial problem.

    Instead of using Expression Templates with reference semantics, one can move to using value semantics and using array_view instance to reference the data, and not an const array&.