android renderscript android-renderscript

Recommended approach to compute over arbitrary sized 3D volume

To frame my question:

I'm writing a custom convolution (for a CNN) where an arbitrary sized HxWxD input volume is convolved with a FxFxD filter. D could be 3 or 4 but also much more. I'm new to RenderScript and currently investigating approaches with the goal of maybe creating a framework which can be used in the future, so I don't want to end up using the API in a way which may deprecate soon. I'm targeting 23 right now, but might need to move back to 18-19 at some point, this is up for discussion.

It appears that if I define a 3D Allocation and use float as type for in-parameter in kernel, the kernel visits every element, also along the Z-axis. Like this:

The kernel:

void __attribute__((kernel)) convolve(float in, uint32_t x, uint32_t y, uint32_t z){
    rsDebug("x y z: ", x, y, z);
}

Java:

Allocation in;
Type.Builder tb = new Type.Builder(mRS, Element.F32(mRS));
Type in_type = tb.setX(W).setY(H).setZ(D).create();
in = Allocation.createTyped(mRS, in_type);
//...
mKonvoScript.forEach_convolve(in);

With W=H=5 and D=3 there are 75 floats in the 3D volume. Running the program prints 75 outputs:

x y: {0.000000, 0.000000, 0.000000} x y: {1.000000, 0.000000, 0.000000} ... x y: {0.000000, 0.000000, 1.000000} x y: {1.000000, 0.000000, 1.000000} ...

etc.

The pattern repeats 3x25 times.

OTOH the reference is unclear about the z-coordinate, and the answer at renderscript: accessing 'z' coordinate states that z-coordinate parameters are not supported.

Also I will need to bind the filter to an rs_allocation variable inside the kernel. Right now I have:

Kernel:

rs_allocation gFilter;
//...
float f = rsGetElementAt_float(gFilter, 1,2,3);

Java:

Allocation filter;
Type filter_type = tb.setX(F).setY(F).setZ(D).create();
filter = Allocation.createTyped(mRS, filter_type);

This seems to work well (no compile or runtime errors). BUT there is an SE entry somehwere from 2014 which states that from version 20 and forward we can only bind 1D allocations, which contradicts my results.

There is a lot of contradictory and outdated information out there, so I'm hoping someone on the inside could comment on this, and recommend an approach both from a sustainability and optimality perspective.

(1) Should I go ahead and use the passed xyz coordinates to compute the convolution with the bound 3D allocation? Or will this approach become deprecated at some point?

(2) There are other ways to do this, for example I can reshape all allocations into 1D, pass them into the kernel and use index arithmetics. This would also allow for placing certain values close to each other. Another approach might be to subdivide the input 3D volumes into blocks with depth 4 and use float4 as in type. Assuming (1) is ok to use, from an optimization perspective, is there a disadvantage in using (1) as opposed to the other approaches?

(3) In general, is there a desirable memory layout formulation, for example to reformulate a problem into float3 or float4 depths, for optimality reasons, as opposed to a "straightforward" approach like (1)?

Solution

1) z is supported now as a coordinate that you can query, so my older answer is outdated. It is also why your example code above doesn't generate a compiler error (assuming you are targeting a relatively modern API level).

2) Stop using bind() even for 1D things (that is the only supported one we have now, but even that isn't a great technique). You can use rs_allocation as a global variable in your .rs file, and set_() from Java to get equivalent access to these global Allocations. Then you use rsGetElementAt_() and rsSetElementAt_*() of the appropriate types to read/write directly in the .rs file.

3) Doing memory layout optimizations like this can be beneficial for some devices and worse on others. If you can use the regular x/y/z APIs, those give the implementation the best opportunity to lay things out efficiently.