Most optimal way of running intensive math calculations on a GPU

I am doing some academic/educational research (topic is related to boolean logic and it's extensions), and there is a mathematical algorithm that I would like to verify over some large set of numbers, more specifically, every integer number from 1 to 2^63. To verify it, I must run the algorithm for each of these numbers, sum the outputs and compare them to some already known pre-defined value. The resulting sum will not be larger than the amount of numbers itself.

I have a C++ host program and an OpenCL kernel that does the actual computation, and it worked fine for values from 1 to 2^32 on an NVIDIA RTX 3060 GPU on a on-demand pod that I rented for few hours - the computation took up to 10-20 seconds with all cores of GPU being used.

However, certainly, when trying to use the same approach for 2^63 numbers, the run time increased exponentially and it would not be possible for me to get a result in my mortal lifetime.

I am now looking for a way to parallelize or optimize this task, and I am bit stuck on:

What application-level tools should I use to achieve parallelization with a given OpenCL kernel?
Are there online services that provide maybe a UI or API for running such programs on a large cluster of computational devices for a reasonable price? I know some universities have such clusters, but seems most of them are private solutions not available for public.

I am looking for a way to speed the computation of the result for 2^63 integers to a few weeks of constant computation at max.

The algorithm itself is a bruteforce one with time-complexity at worst case of O(n^2) where n is the number of bits of an individual integer number that is tested by it, and this cannot be improved or changed, it's intentionally made for bruteforcing over a large set of numbers and for now, my task is to prove it's correctness. The kernel itself contains as many optimizations as possible, some of them also exit-early if some of the criteria is matched, avoiding the computationally-expensive operations if possible.

Thanks in advance for any help.

Solution

I suggest you make a rough guesstimate of the number of operations required to perform your test.

n = max no. of bits of argument

Assume Ops to test one integer = c*n^2 for some constant c

TotalOps = c*sum_{i=1 to 63} (2^(i-1))(i^2) approx. = c*3.5e+22

Estimating c from the OP's earlier test...

c*sum_{i=1 to 32} (2^(i-1))(i^2) approx. = c*4.1e+12

Suppose that was 100 trillion operations (10s @ 10 trillion ops/s), that would make c approx. = 25.

Suppose you can perform 10 trillion operations per second on a single GPU. Then it would take c*3.5e+9 GPU seconds, or about 2,800 GPU years. Unless you have a sizeable budget for this, then it's a non-starter. To perform 2800GPU years of work in 1 month means using 33,000 (RTX 3060) GPUs. That's a modern exascale supercomputer, for a month. A wealthy nation state might be able to do this, but the cost would probably be many millions, even if such machines were available to hire for general purpose work, which they might not be.