GPU Brute-Force Implementation

I am asking for an advise for the following problem:

For a research-project I am writing a brute-force algorithm based on a GPU with (py)OpenCl.

(I know JTR is out there)

Right now I do have a Brute-Force-Generator in Python which is filling up for each round the buffer with words (amount=1024*64).I pass the buffer to the GPU Kernel. The GPU is calculating for each value in the buffer a MD5 Hash Value and compares it with a given one. Great that it works!

BUT:

I don't think this is really the full performance i can get from the GPU - or is it? Isn't there a bottleneck when i have to fill up the buffer by the CPU and pass it to the GPU 'just' for a Hash calculation an comparison - or am i wrong and this is already the fastet or almost the fastet performance i can get?

I have done a lot of Research before I consider to ask this question here. I couldn't find a brute force implementation on the GPU kernel so far - why?

Thx

EDIT 1:

I try to explain it in a different way what I want to know. Lets say I have an average computer. Performing an brute-force-algorithm on a GPU is faster than on a CPU (if you do it right). I have looked through some GPU brute-force tools and couldn't find one with the whole brute-force implementation on the GPU Kernel.

Right now I am passing "word packages" to the GPU and let them do the work (hash & compare) there - looks like this is the common way . Isn't it faster to 'split' the brute-force algorithm so each Unit on the GPU will generate its own "word packages" by itself.

All I do is wondering why the common way is to pass packages with values from the CPU to the GPU instead of doing the CPU work also on the GPU work! Is it because it is not possible to split a brute-force algorithm on a GPU or isn't the improvement worth the effort to port it to the GPU?

Solution

About the performance of the "brute-force" approach.

All i do is wondering why the common way is to pass packages with values from the CPU to the GPU instead of doing the CPU work also on the GPU work! Is it because it is not possible to split a brute-force algorithm on a GPU or isn't the improvement worth the effort to port it to the GPU?

I do not know the details of your algorithm, but, in general, there are some points to consider before creating a hybrid CPU-GPU algorithm. Just to name a few:

Different architectures (best CPU algorithm probably is not the best GPU algorithm).
Extra synchronization points.
Different memory spaces (implies PCIe/network transfers).
More complex algorithms
More complex fine tuning.
Vendors policy.

Nevertheless, there are quite a few examples out there that combines the power of the GPU and the CPU at the same time. Typically, sequential or highly divergent parts of the algorithm will run on the CPU while the homogeneous, computing intensive part runs on the GPU. Other applications, uses the CPU to preprocess the input data to a format that is more amenable to GPU processing (for instance, changing the data layout). Finally, there are applications targeting pure performance that really do a significant amount of work on the CPU, like the MAGMA project.

In summary, the answer it that it really depends on the details of your algorithm if it is really possible or if it worth it to design a hybrid algorithm that takes the most of your CPU-GPU system as a whole.

About the performance of your current approach

I think you should break down your question in two parts:

It is my GPU kernel efficient?
How much time am I actually doing work at the GPU?

Regarding the first one, you didn't provide any information about your GPU kernel so we could not really help you with it, but general optimization approaches apply:

Is it your computation memory/compute bound?
How far are you from your GPU peak memory bandwidth?

You need to start from these question in order to known what kind of optimization/algorithm you should apply. Take a look at the roofline performance model.

As for the second question, even though you don't go into detail, it seems like your application spend so much time on small memory transfers (take a look at this article about how to optimize memory transfers). The overhead of starting the PCIe just to send a few words would kill any performance benefit you get from using a GPU device. Thus, sending a bunch of small buffers instead of large chunks of memory packing a large number of them is not, in general, the way to go.

If you're looking for performance, you may want to overlap the computation and the memory transfer. Read this article for more information.

As a general recommendation, before implementing any optimization, take some time to profile your application. It would save you a lot of time.