arrays multithreading thread-synchronization

Search through array with multiple threads while not doing any more extra work than necessary

Suppose you had a large unsorted array of length n that you wanted to search through to find a particular element (let the elements of this array be unique). Since you’d have to necessarily search through the entire array for the element in the worst case, the running time would be O(n). However, since most CPUs nowadays support multiple cores (with hyperthreading), you could have multiple threads search through the array to speed it up. So, with m cores, you’d have 2m (independent) threads available at your disposal. If you delegated only a portion of the array to each thread i.e. give each of the 2m threads n/2m elements of the array to process, it would be optimal. However, when one the 2m threads finds the element, the other threads will need to be stopped (to preserve system resources) since all the elements are unique and the other threads will never find the element.

So my question is this: How would you search through a large unsorted array with unique elements with 2m threads while minimizing the work done by your threads and the running time? What synchronized data structures would you need? How do you stop the other 2m - 1 threads when the element is found?

Solution

Probably the simplest way would be to have an atomic-boolean (std::atomic<bool> in C++-speak), and have the thread that finds the number set that boolean to true before exiting.

In addition to that, have each thread divide up its portion of the array into sub-portions, so that it can do a tight loop looking for the number in each sub-portion, then check the atomic-boolean, then repeat until it has run out of sub-portions to check. (The reason to use sub-portions rather than just checking the atomic-boolean after each iteration of one big for-loop is that even checking an atomic-boolean will be rather expensive due to cache-coherency issues, so it's better to amortize each atomic-boolean-check over a larger number of iterations and trade off a bit of extra/wasted work in exchange for better parallelism)

The ideal size to make each sub-portion would be something you'd need to derive empirically, by trying it with different sizes until you found the one that gave best performance.