Search code examples
c++ioshalide

Using Halide ahead-of-time (AOT) with Metal on iOS


I'm trying to use Metal as the target for my ahead-of-time (AOT) Halide pipeline for use on iOS.

I've successfully created a Halide generator that generates a static binary using Metal. I can link and call this binary in my iOS app.

However, when I pass an Buffer<uint8_t> input_ to the function, the data in the Buffer always seems to be zero on the GPU side. Note that this only happens when running on the GPU on iOS.

Generator

#include "Halide.h"

using namespace Halide;

class MyHalideTest : public Halide::Generator<MyHalideTest> {
public:
    Input<Buffer<uint8_t>> input{"input", 3};
    Input<int32_t> width{"width"};
    Input<int32_t> height{"height"};
    Output<Buffer<uint8_t>> output{"output", 3};

    void generate() {
        output(x,y,c) = cast<uint8_t>(input(x,y,c)+25);
    }

    void schedule() {
        input
            .dim(0).set_stride(4)
            .dim(2).set_stride(1).set_bounds(0, 4);
        output
            .dim(0).set_stride(4)
            .dim(2).set_stride(1).set_bounds(0, 4);

        if (get_target().has_gpu_feature()) {
            output
                .reorder(c, x, y)
                .bound(c, 0, 4)
                .unroll(c);
            output.gpu_tile(x, y, xo, yo, xi, yi, 16, 16);
        }
        else {
            output
                .reorder(c, x, y)
                .unroll(c)
                .split(y, yo, yi, 16)
                .parallel(yo)
                .vectorize(x, 8);
        }
    }

private:
    Var x{"x"}, y{"y"}, c{"c"}, xi{"xi"}, xo{"xo"}, yi{"yi"}, yo{"yo"};

};

HALIDE_REGISTER_GENERATOR(MyHalideTest, "halide_test")

Command line to generate Generator

./MyHalideTest_generator -g halide_test \
-f halide_test_ARM64_metal \
-n halide_test_ARM64_metal \
-o "${DERIVED_FILE_DIR}" \
target=arm-64-ios-metal-debug-user_context

iOS code calling Halide function

Buffer<uint8_t> input_;
Buffer<uint8_t> output_;

// Other setup

- (void)initBuffersWithWidth:(int)w height:(int)h using_metal:(bool)using_metal
{
    // We really only need to pad this for the use_metal case,
    // but it doesn't really hurt to always do it.
    const int c = 4;
    const int pad_pixels = (64 / sizeof(int32_t));
    const int row_stride = (w + pad_pixels - 1) & ~(pad_pixels - 1);
    const halide_dimension_t pixelBufShape[] = {
        {0, w, c},
        {0, h, c * row_stride},
        {0, c, 1}
    };

    input_ = Buffer<uint8_t>(nullptr, 3, pixelBufShape);
    input_.allocate();
    auto buf = input_.raw_buffer()->host;
    memset(buf, 200, input_.size_in_bytes());

    // This allows us to make a Buffer with an arbitrary shape
    // and memory managed by Buffer itself
    output_ = Buffer<uint8_t>(nullptr, 3, pixelBufShape);
    output_.allocate();
}

...

/** Calling Halide function here **/
halide_test((__bridge void *)self, input_, width, height, output_);
output_.copy_to_host();

// Display output image...

So, the code sets the input_ buffer to be values of 200. The returned output_ buffer should be 225, but it's not. All the values are only 25.

I should note that this works correctly when running on my laptop's GPU and on the phone's CPU. The only difference is the Halide generator target.

Any ideas on why the Input<Buffer<uint8_t>> input seems to be set to all zeros when running the Halide function?

The debug statements seem to malloc memory on the device side, but I don't see an explicit statement saying halide_copy_to_device.


Solution

  • If you set values in a Buffer, you need to mark it dirty: input_.set_host_dirty()