Thread Pool in Java slower than serial version

I have a program that expands a given node to find the next possible nodes (children) and save/return them in childrenSet. I first implemented this serially like this:

    private Set<ReversiState> getChildrenSerial() {

        HashSet<ReversiState> childrenSet = new HashSet<>();

        // For each piece of the player, add the children.
        for(int row=0; row<BOARD_SIZE; row++){
            for(int col=0; col<BOARD_SIZE; col++){

                // This is where the heavy work happens
                addChildrenForPosition(childrenSet, row, col);
            }
        }

        return childrenSet;
    }

When I run my program using this, it finishes in around 9.7 seconds. The minimax algorithm that uses this method can on average search for a depth of 7.0 nodes.

However, I wanted to be able to search deeper, as that is more desirable for the outcome I want. To do that, I tried to use Java's ThreadPoolExecutor as a static final in the same class. But got worse results.

private static final int NB_THREADS = 8;
private static final ThreadPoolExecutor executor = (ThreadPoolExecutor) 
    Executors.newFixedThreadPool(NB_THREADS);

And implemented the getChildrenParallel method. This is essentially the same as getChildrenSerial, but gives the task addChildrenForPosition to the thread pool to handle.

    private Set<ReversiState> getChildrenParallel() {

        HashSet<Future<Void>> threadResults = new HashSet<>();
        HashSet<ReversiState> childrenSet = new HashSet<>();

        // For each piece of the player, add the children.
        for(int row=0; row<BOARD_SIZE; row++){
            for(int col=0; col<BOARD_SIZE; col++){

                // Multi-threading takes final variables.
                final Integer rowFinal = row;
                final Integer colFinal = col;

                Submit a task to the thread pool.
                Future<Void> future = executor.submit(

                         // This is the method where the heavy work happens
                    () -> addChildrenForPosition(childrenSet, rowFinal, colFinal), 
                    null);
                threadResults.add(future);
                }
            }
        }

        // Wait for all tasks to finish.
        for(Future<Void> future : threadResults){
            try{
                future.get();
            } catch(Exception e){
                e.printStackTrace();
            }
        }
        return childrenSet;
    }

I of course expected this to run faster than the serial version as the threads are emulated by the OS to somewhat give the resemblance of actual parallelism. However this takes on average 11 seconds to run and the depth reduces to an average of 6.3. It is slightly worse than the serial implementation when I expected at least a doubling on performance.

Why is this happening? Is it because it takes some time to submit to the thread pool? Is the overhead too much for how small the task is? What can I do to correct it?

P.S.: I am running this on Windows 11.

Solution

Java Concurrency In Practice writes:

The actual cost of context switching varies across platforms, but a good rule of thumb is that a context switch costs the equivalent of 5,000 to 10,000 clock cycles, or several microseconds on most current processors.

They explain:

Context switches are not free; thread scheduling requires manipulating shared data structures in the OS and JVM. The OS and JVMuse the same CPUs your program does; more CPU time spent in JVM and OS code means less is available for your program. But OS and JVM activity is not the only cost of context switches. When a new thread is switched in, the data it needs is unlikely to be in the local processor cache, so a context switch causes a flurry of cache misses, and thus threads run a little more slowly when they are first scheduled.

For your program, I'd expect cache misses to be quite severe, indeed. Your code processes each node in the search tree in a seperate thread. So one thread will read the board state (from main memory), create a slightly modified copy, and rather than processing that copy while it is at hand, schedule its processing for a later time. One would be hard pressed to find a less efficient way to access main memory ...

If you wish to make this concurrent, a ForkJoinPool might be a better fit. But frankly, you can achieve far greater performance improvements by improving your algorithm. For instance, you are currently copying the entire board every time you consider a move. Updating an existing board would be far faster ...

I should also point out that your code is incorrectly synchronized. For instance, a HashSet is not safe for concurrent access.

To conclude, multithreading can help with processing deep state trees, but

should be done with rather coarse tasks to make efficient use of CPU caches,
requires good understanding of when and how to protect shared data structures from concurrent modification,
and tends to be far less impactful than the use of efficient algorithms