Halide with GPU (OpenGL) as Target - benchmarking and using HalideRuntimeOpenGL.h

I am new to Halide. I have been playing around with the tutorials to get a feel for the language. Now, I am writing a small demo app to run from command line on OSX.

My goal is to perform a pixel-by-pixel operation on an image, schedule it on the GPU and measure the performance. I have tried a couple things which I want to share here and have a few questions about the next steps.

First approach

I scheduled the algorithm on GPU with Target being OpenGL, but because I could not access the GPU memory to write to a file, in the Halide routine, I copied the output to the CPU by creating Func cpu_out similar to the glsl sample app in the Halide repo

pixel_operation_cpu_out.cpp

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

const int _number_of_channels = 4;

int main(int argc, char** argv)
{
    ImageParam input8(UInt(8), 3);

    input8
        .set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
        .set_stride(2, 1); // stride in dimension 2 (c) is one

    Var x("x"), y("y"), c("c");

    // algorithm
    Func input;
    input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
                                 clamp(y, input8.top(), input8.bottom()),
                                 clamp(c, 0, _number_of_channels))) / 255.0f;

    Func pixel_operation;

    // calculate the corresponding value for input(x, y, c) after doing a 
    // pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
    // This operation is not location dependent, eg: brighten

    Func out;
    out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
    out.output_buffer()
        .set_stride(0, _number_of_channels)
        .set_stride(2, 1);
    input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
    out.output_buffer().set_bounds(2, 0, _number_of_channels);

    // schedule

     out.compute_root();
     out.reorder(c, x, y)
         .bound(c, 0, _number_of_channels)
         .unroll(c);

    // Schedule for GLSL

    out.glsl(x, y, c);

    Target target = get_target_from_environment();
    target.set_feature(Target::OpenGL);

    // create a cpu_out Func to copy over the data in Func out from GPU to CPU
    std::vector<Argument> args = {input8};
    Func cpu_out;
    cpu_out(x, y, c) = out(x, y, c);
    cpu_out.output_buffer()
        .set_stride(0, _number_of_channels)
        .set_stride(2, 1);
    cpu_out.output_buffer().set_bounds(2, 0, _number_of_channels);
    cpu_out.compile_to_file("pixel_operation_cpu_out", args, target);

    return 0;
}

Since I compile this AOT, I make a function call in my main() for it. main() resides in another file.

main_file.cpp

Note: the Image class used here is the same as the one in this Halide sample app

int main()
{
    char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
    unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);

    Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    input.buf.host = &pixelsRGBA[0];
    unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
    output.buf.host = &outputPixelsRGBA[0];

    double best = benchmark(100, 10, [&]() {
         pixel_operation_cpu_out(&input.buf, &output.buf);
    });

    char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
    write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);
}

This works just fine and gives me the output I expect. From what I understand, cpu_out makes the values in out available on the CPU memory, which is why I am able to access these values by accessing output.buf.host in main_file.cpp

Second approach:

The second thing I tried was to not do the copy to host from device in the Halide schedule by creating Func cpu_out, instead using copy_to_host function in main_file.cpp.

pixel_operation_gpu_out.cpp

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

const int _number_of_channels = 4;

int main(int argc, char** argv)
{
    ImageParam input8(UInt(8), 3);

    input8
        .set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
        .set_stride(2, 1); // stride in dimension 2 (c) is one

    Var x("x"), y("y"), c("c");

    // algorithm
    Func input;
    input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
                                 clamp(y, input8.top(), input8.bottom()),
                                 clamp(c, 0, _number_of_channels))) / 255.0f;

    Func pixel_operation;

    // calculate the corresponding value for input(x, y, c) after doing a 
    // pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
    // This operation is not location dependent, eg: brighten

    Func out;
    out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
    out.output_buffer()
        .set_stride(0, _number_of_channels)
        .set_stride(2, 1);
    input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
    out.output_buffer().set_bounds(2, 0, _number_of_channels);

    // schedule

     out.compute_root();
     out.reorder(c, x, y)
         .bound(c, 0, _number_of_channels)
         .unroll(c);

    // Schedule for GLSL

    out.glsl(x, y, c);

    Target target = get_target_from_environment();
    target.set_feature(Target::OpenGL);

    std::vector<Argument> args = {input8};
    out.compile_to_file("pixel_operation_gpu_out", args, target);

    return 0;
}

main_file.cpp

#include "pixel_operation_gpu_out.h"
#include "runtime/HalideRuntime.h"

int main()
{
    char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
    unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);

    Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
    input.buf.host = &pixelsRGBA[0];
    unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
    output.buf.host = &outputPixelsRGBA[0];

    double best = benchmark(100, 10, [&]() {
         pixel_operation_gpu_out(&input.buf, &output.buf);
    });

    int status = halide_copy_to_host(NULL, &output.buf);

    char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
    write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);

    return 0;
}

So, now, what I think is happening is that pixel_operation_gpu_out is keeping output.buf on the GPU and when I do copy_to_host, that's when I get the memory copied over to the CPU. This program gives me the expected output as well.

Questions:

The second approach is much slower than the first approach. The slow part is not in the benchmarked part though. For example, for first approach, I get 17ms as benchmarked time for a 4k image. For the same image, in the second approach, I get the benchmarked time as 22us and the time taken for copy_to_host is 10s. I'm not sure if this behavior is expected since both approach 1 and 2 are essentially doing the same thing.

The next thing I tried was to use [HalideRuntimeOpenGL.h][3] and link textures to input and output buffers to be able to draw directly to a OpenGL context from main_file.cpp instead of saving to a jpeg file. However, I could find no examples to figure out how to use the functions in HalideRuntimeOpenGL.h and whatever things I did try on my own were always giving me run time errors which I could not figure out how to solve. If anyone has any resources they can point me to, that will be great.

Also, any feedback on the code I have above are welcome too. I know it works and is doing what I want but it could be the completely wrong way of doing it and I wouldn't know any better.

Solution

Mostly likely the reason for the 10s to copy memory back is because the GPU API has queued all the kernel invocations and then waits on them to finish when halide_copy_to_host is called. You can call halide_device_sync inside the benchmark timing after running all the compute calls to handle get the compute time inside the loop without the copy back time.

I cannot tell from the code how many times the kernel is being run from this code. (My guess is 100, but it may be that those arguments to benchmark setup some sort of parameterization where it tries to run it as many times as need be to get significance. If so, that is a problem because the queuing call is really fast but the compute is of course async. If this is the case, you can do things like queue ten calls and then call halide_device_sync and play with the number "10" to get a real picture of how long it takes.)