c#hash parallel-processing cryptography md5

Is there a way to speed up my code to find an md5 hash that starts with all zeros?

It is just for a hobby, no official or serious stuff involved. I do have a question though; when running this code, it computes around 400 million md5 hashes in the first second, then it drops down to ~10 million hashes per second, is there a way to speed this up?

Here is the code:

using System;
using System.Security.Cryptography;
using System.Threading;
using System.Threading.Tasks;
using System.Runtime.InteropServices;
using Org.BouncyCastle.Crypto.Digests;
using Org.BouncyCastle.Crypto.Paddings;

public class SimdMD5HashFinder
{
    private static readonly int numberOfThreads = 16; // Number of threads (cores)
    private static readonly CancellationTokenSource cts = new CancellationTokenSource(); // Cancellation token for cancellation management
    private const int batchSize = 25000000; // Number of random byte arrays to process in each batch
    private const int byteArraySize = 16; // Size of each random byte array (16 bytes for MD5 input)
    public static ulong n = 0;

    [DllImport("kernel32.dll")]
    private static extern IntPtr GetCurrentThread();

    [DllImport("kernel32.dll", SetLastError = true)]
    private static extern IntPtr SetThreadAffinityMask(IntPtr hThread, IntPtr dwThreadAffinityMask);

    public static void Main(string[] args)
    {
        Console.WriteLine("Starting multiple threads to find MD5 hashes with the first three bytes equal to zero...");

        Task[] tasks = new Task[numberOfThreads];

        for (int i = 0; i < numberOfThreads; i++)
        {
            int threadId = i; // Capture the current thread index
            tasks[i] = Task.Run(() => FindHash(threadId));
        }

        // Wait for tasks to complete
        Task.WaitAll(tasks);

        Console.WriteLine("All threads have completed or a match was found.");
        Console.ReadKey();
    }

    private static void FindHash(int threadId)
    {
        // Pin the thread to a specific CPU core
        PinThreadToCore(threadId);

        // Create MD5 digest object from BouncyCastle
        MD5Digest md5 = new MD5Digest();
        byte[] randomBytes = new byte[batchSize * byteArraySize]; // Single large array to hold all random bytes for the batch
        byte[] hashBytes = new byte[md5.GetDigestSize()]; // MD5 hash is always 16 bytes
        RandomNumberGenerator rng = RandomNumberGenerator.Create();

        while (!cts.Token.IsCancellationRequested) // Loop until cancellation is requested
        {
            n += (ulong)batchSize; // Increment n by batchSize

            if (n % 1000000 == 0)
            {
                Console.WriteLine(n);
            }

            // Fill the large buffer with random bytes
            rng.GetBytes(randomBytes);
            // Process the buffer in chunks, each chunk is a 16-byte random array
            for (int i = 0; i < randomBytes.Length; i += byteArraySize)
            {
                // Calculate MD5 hash of the 16-byte chunk
                md5.BlockUpdate(randomBytes, i, byteArraySize);
                md5.DoFinal(hashBytes, 0);

                // Check if the first three bytes of the hash are all zero
                if (hashBytes[0] == 0 && hashBytes[1] == 0 && hashBytes[2] == 0 && hashBytes[3] == 0)
                {
                    Console.WriteLine($"\nThread {threadId} found a match!");
                    Console.WriteLine($"Random Bytes:    {BitConverter.ToString(randomBytes, i, byteArraySize).Replace("-", "").ToLower()}"); // Show the chunk of random bytes as hex
                    Console.WriteLine($"MD5  Hash Bytes: {BitConverter.ToString(hashBytes).Replace("-", "").ToLower()}");
                    // Signal cancellation to stop all other threads
                    cts.Cancel();
                    return; // Exit the method once a match is found
                }

                // Reset the MD5 digest for the next round
                md5.Reset();
            }
        }
    }

    // Method to pin a thread to a specific core
    private static void PinThreadToCore(int coreId)
    {
        IntPtr mask = new IntPtr(1 << coreId); // Create affinity mask for the given core
        IntPtr thread = GetCurrentThread(); // Get current thread handle

        // Set the affinity mask to bind the thread to the specific core
        SetThreadAffinityMask(thread, mask);
    }
}

I tried almost everything, but this is the fastes way of doing things as far as I know. I create a buffer of size N, since otherwise it slows down because of the overhead of creating a new MD5 object every time. I tried looking into GPU computing with ILGPU but that i found way too compicated.

Solution

There are a couple possible improvements.

Side note: I'm not terribly familiar with C#, so I might make mistakes.

Don't use random bytes for the hashes. Any decent cryptographic hash will transform the input to something that looks completely random, so using random values adds a bunch more work, while a counter is just as good.
On the topic of the counter, you have two main options: a BigInteger or an unsigned long counter.

The BigInteger is simpler, but is much slower. On the other hand, you could use an unsigned long counter (one per task) and a unique-per-task one-byte ID. That would give you ~18 quintillion iterations per thread before running out.
You could add a few micro-optimizations, like:
- You should remove the logging statement every 1 000 000 iterations. It's a useful feature for debugging, but it adds a division and branch to each hash.
- Whenever implementing a new part, avoid initializations and branches. Initializations are much more expensive, but branches can still be bad. These issues can easily happen when adding new parts.

On a similar note, if you really want to make it fast...

You could reimplement it in C or assembly. C# has a little bit of overhead, and assembly/C allows you to use AVX and even SHA-NI (if you are able to use SHA-1 instead of MD5).
You could also use a graphics card to speed it up (using ROCm/CUDA). On my Ryzen 5 5600, I get ~6.6 million hashes per second (according to OpenSSL), while an RTX 2070 can manage 20 billion (according to this comment).