Why does my C++ function run 10x faster than the C# P/Invoke call?

I have a C++ function that uses AVX2 intrinsics to brighten an image. When I measure the performance directly in C++, it takes around 500 microseconds to process an image with a resolution of 3840 x 240. However, when I call the same function from C# using P/Invoke, it takes about 4 milliseconds, which is much slower.

For an 3840 x 2160 image native C++ takes about 1.5ms while P/Invoke takes about 4.5 ms.

For an 10000 x 4000 image native C++ takes about 3 ms while P/Invoke takes about 8 ms.

Here is my setup:

C++: A native function that processes the image data using AVX2. C#: Uses P/Invoke to call this function, passing the image data directly by reference.

C# main program:

static void Main(string[] args) 
{
    int width = 3840;
    int height = 240;
    byte[] image = new byte[width * height];
    Random random = new Random();
    
    // Fill the image array with random brightness values between 0 and 255
    for (int i = 0; i < image.Length; i++)
    {
        image[i] = (byte)random.Next(0, 256);
    }

    byte brightness = 30;
    
    // Measure C# processing time
    Stopwatch sw = Stopwatch.StartNew();
    ProcessorCSharp.BrightenImage(image, brightness);
    sw.Stop();

    Console.WriteLine("C# Time: {0} microseconds", sw.Elapsed.TotalMilliseconds * 1000);
}

C# Wrapper Class:

public class ProcessorCSharp
{
    [DllImport("ImageProcessingLib.dll", CallingConvention = CallingConvention.Cdecl)]
    private static extern void brightenImageSIMD(IntPtr image, int size, byte brightness);

    public static unsafe void BrightenImage(byte[] image, byte brightness)
    {
        int size = image.Length;

        fixed (byte* p = image)
        {
            brightenImageSIMD((IntPtr)p, size, brightness);
        }
    }
}

C++ Function:

#include <immintrin.h>  // AVX2 intrinsics
#include <vector>
#include <algorithm>    // For std::min

extern "C" __declspec(dllexport) // in my main c++ code this line doesn't exist
void brightenImageSIMD(uint8_t* image, size_t size, uint8_t brightness) {
    size_t i = 0;
    __m256i brightnessVector = _mm256_set1_epi8(brightness);
    __m256i maxVector = _mm256_set1_epi8(255);

    for (; i + 31 < size; i += 32) {
        __m256i pixels = _mm256_loadu_si256((__m256i*) &image[i]);
        __m256i brightened = _mm256_adds_epu8(pixels, brightnessVector);
        __m256i clamped = _mm256_min_epu8(brightened, maxVector);
        _mm256_storeu_si256((__m256i*) &image[i], clamped);
    }

    for (; i < size; ++i) {
        image[i] = std::min(image[i] + brightness, 255);
    }
}

// Helper function to measure execution time
template <typename Func, typename... Args>
long long measureExecutionTime(Func func, Args&&... args) {
    auto start = std::chrono::high_resolution_clock::now();
    func(std::forward<Args>(args)...);
    auto end = std::chrono::high_resolution_clock::now();
    return std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
}

int main() {
    const int width = 3840;
    const int height = 2160;
    const uint8_t brightnessIncrease = 30;

    std::vector<uint8_t> image(width * height);

    // Set up random number generation
    std::random_device rd;  // Seed for the random number engine
    std::mt19937 gen(rd()); // Standard mersenne_twister_engine
    std::uniform_int_distribution<> dis(0, 255); // Range from 0 to 255 for 8-bit brightness levels

    // Fill the image with random values
    for (auto& pixel : image) {
        pixel = static_cast<uint8_t>(dis(gen));
    }

    // Measure performance of SIMD method
    auto imageCopy3 = image; // Make another copy for fair comparison
    long long timeSIMD = measureExecutionTime(brightenImageSIMD, imageCopy3, brightnessIncrease);

    std::cout << "SIMD (AVX2) method time: " << timeSIMD << " microseconds" << std::endl;
    

    return 0;
}

Problem:

The main reason for me trying C++ wrapped functions in C# is I need better performance than I am achieving in C# for realtime image processing applications.
For example in C++ alone, brightenImageSIMD takes around 500 microseconds. But when I call it from C#, it consistently takes about 4 milliseconds. I've tried using unsafe code with a fixed pointer to prevent array copying, but the performance difference remains.

Questions:

Why is there such a large performance gap between the native C++ execution and the C# P/Invoke call? What can I do to bring the C#-called version closer to the performance of the native C++? How does libraries like OpenCvSharp achieve excellent performance with P/Invoke? OpenCvSharp calls native OpenCV functions via P/Invoke and still maintains very high performance, so I'm curious if there are techniques from that library that could apply here that I am missing.

NOTE: I AM USING .NET FRAMEWORK 4.8

NOTE: I RAN BOTH OF THEM 100 TIMES AND TOOK THE AVERAGE OF THEIR PROCESSING TIME. THEY ARE VERY SIMILAR TO EACH OTHER NOW. HOWEVER IN MY USE CASE I CANT RUN A FUNCTION 100 TIMES FOR IT TO RUN FASTER.

Solution

The delay was indeed caused by the initial loading of the DLL when the function was first called. In .NET, using P/Invoke to call unmanaged functions can have an overhead the first time a function is invoked, as it involves loading the external DLL into memory. This can add a noticeable delay, especially in performance-sensitive applications where even slight delays matter.

To avoid this first-call delay, I resolved the issue by preloading the DLL at the start of my application, well before the function was needed.

using System.Runtime.InteropServices;

public static class DllLoader
{
    [DllImport("ImageProcessingLib.dll", CallingConvention = 
CallingConvention.Cdecl)]
    private static extern void DummyFunction();

    public static void Load()
    {
        // Calling a dummy function just to load the DLL
        DummyFunction();
    }
}

static void Main(string[] args)
{
    // Preload the DLL to avoid loading delay on the first function call
    DllLoader.Load();

    // Now we can call other functions without the initial load overhead
}

This is one of the reasons OpenCvSharp runs fast in C#!