I have a C++ function that uses AVX2 intrinsics to brighten an image. When I measure the performance directly in C++, it takes around 500 microseconds to process an image with a resolution of 3840 x 240. However, when I call the same function from C# using P/Invoke, it takes about 4 milliseconds, which is much slower.
For an 3840 x 2160 image native C++ takes about 1.5ms while P/Invoke takes about 4.5 ms.
For an 10000 x 4000 image native C++ takes about 3 ms while P/Invoke takes about 8 ms.
Here is my setup:
C++: A native function that processes the image data using AVX2. C#: Uses P/Invoke to call this function, passing the image data directly by reference.
static void Main(string[] args)
{
int width = 3840;
int height = 240;
byte[] image = new byte[width * height];
Random random = new Random();
// Fill the image array with random brightness values between 0 and 255
for (int i = 0; i < image.Length; i++)
{
image[i] = (byte)random.Next(0, 256);
}
byte brightness = 30;
// Measure C# processing time
Stopwatch sw = Stopwatch.StartNew();
ProcessorCSharp.BrightenImage(image, brightness);
sw.Stop();
Console.WriteLine("C# Time: {0} microseconds", sw.Elapsed.TotalMilliseconds * 1000);
}
public class ProcessorCSharp
{
[DllImport("ImageProcessingLib.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern void brightenImageSIMD(IntPtr image, int size, byte brightness);
public static unsafe void BrightenImage(byte[] image, byte brightness)
{
int size = image.Length;
fixed (byte* p = image)
{
brightenImageSIMD((IntPtr)p, size, brightness);
}
}
}
#include <immintrin.h> // AVX2 intrinsics
#include <vector>
#include <algorithm> // For std::min
extern "C" __declspec(dllexport) // in my main c++ code this line doesn't exist
void brightenImageSIMD(uint8_t* image, size_t size, uint8_t brightness) {
size_t i = 0;
__m256i brightnessVector = _mm256_set1_epi8(brightness);
__m256i maxVector = _mm256_set1_epi8(255);
for (; i + 31 < size; i += 32) {
__m256i pixels = _mm256_loadu_si256((__m256i*) &image[i]);
__m256i brightened = _mm256_adds_epu8(pixels, brightnessVector);
__m256i clamped = _mm256_min_epu8(brightened, maxVector);
_mm256_storeu_si256((__m256i*) &image[i], clamped);
}
for (; i < size; ++i) {
image[i] = std::min(image[i] + brightness, 255);
}
}
// Helper function to measure execution time
template <typename Func, typename... Args>
long long measureExecutionTime(Func func, Args&&... args) {
auto start = std::chrono::high_resolution_clock::now();
func(std::forward<Args>(args)...);
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
}
int main() {
const int width = 3840;
const int height = 2160;
const uint8_t brightnessIncrease = 30;
std::vector<uint8_t> image(width * height);
// Set up random number generation
std::random_device rd; // Seed for the random number engine
std::mt19937 gen(rd()); // Standard mersenne_twister_engine
std::uniform_int_distribution<> dis(0, 255); // Range from 0 to 255 for 8-bit brightness levels
// Fill the image with random values
for (auto& pixel : image) {
pixel = static_cast<uint8_t>(dis(gen));
}
// Measure performance of SIMD method
auto imageCopy3 = image; // Make another copy for fair comparison
long long timeSIMD = measureExecutionTime(brightenImageSIMD, imageCopy3, brightnessIncrease);
std::cout << "SIMD (AVX2) method time: " << timeSIMD << " microseconds" << std::endl;
return 0;
}
The main reason for me trying C++ wrapped functions in C# is I need better performance than I am achieving in C# for realtime image processing applications.
For example in C++ alone, brightenImageSIMD takes around 500 microseconds. But when I call it from C#, it consistently takes about 4 milliseconds. I've tried using unsafe code with a fixed pointer to prevent array copying, but the performance difference remains.
Why is there such a large performance gap between the native C++ execution and the C# P/Invoke call? What can I do to bring the C#-called version closer to the performance of the native C++? How does libraries like OpenCvSharp achieve excellent performance with P/Invoke? OpenCvSharp calls native OpenCV functions via P/Invoke and still maintains very high performance, so I'm curious if there are techniques from that library that could apply here that I am missing.
NOTE: I AM USING .NET FRAMEWORK 4.8
NOTE: I RAN BOTH OF THEM 100 TIMES AND TOOK THE AVERAGE OF THEIR PROCESSING TIME. THEY ARE VERY SIMILAR TO EACH OTHER NOW. HOWEVER IN MY USE CASE I CANT RUN A FUNCTION 100 TIMES FOR IT TO RUN FASTER.
The delay was indeed caused by the initial loading of the DLL when the function was first called. In .NET, using P/Invoke to call unmanaged functions can have an overhead the first time a function is invoked, as it involves loading the external DLL into memory. This can add a noticeable delay, especially in performance-sensitive applications where even slight delays matter.
To avoid this first-call delay, I resolved the issue by preloading the DLL at the start of my application, well before the function was needed.
using System.Runtime.InteropServices;
public static class DllLoader
{
[DllImport("ImageProcessingLib.dll", CallingConvention =
CallingConvention.Cdecl)]
private static extern void DummyFunction();
public static void Load()
{
// Calling a dummy function just to load the DLL
DummyFunction();
}
}
static void Main(string[] args)
{
// Preload the DLL to avoid loading delay on the first function call
DllLoader.Load();
// Now we can call other functions without the initial load overhead
}
This is one of the reasons OpenCvSharp runs fast in C#!