Search code examples
c#java.neteclipsejvm-hotspot

Java is scaling much worse than C# over many cores?


I am testing spawning off many threads running the same function on a 32 core server for Java and C#. I run the application with 1000 iterations of the function, which is batched across either 1,2,4,8, 16 or 32 threads using a threadpool.

At 1, 2, 4, 8 and 16 concurrent threads Java is at least twice as fast as C#. However, as the number of threads increases, the gap closes and by 32 threads C# has nearly the same average run-time, but Java occasionally takes 2000ms (whereas both languages are usually running about 400ms). Java is starting to get worse with massive spikes in the time taken per thread iteration.

EDIT This is Windows Server 2008

EDIT2 I have changed the code below to show using the Executor Service threadpool. I have also installed Java 7.

I have set the following optimisations in the hotspot VM:

-XX:+UseConcMarkSweepGC -Xmx 6000

but it still hasnt made things any better. The only difference between the code is that im using the below threadpool and for the C# version we use:

http://www.codeproject.com/Articles/7933/Smart-Thread-Pool

Is there a way to make the Java more optimised? Perhaos you could explain why I am seeing this massive degradation in performance?

Is there a more efficient Java threadpool?

(Please note, I do not mean by changing the test function)

import java.io.DataOutputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;

public class PoolDemo {

    static long FastestMemory = 2000000;
    static long SlowestMemory = 0;
    static long TotalTime;
    static int[] FileArray;
    static DataOutputStream outs;
    static FileOutputStream fout;
    static Byte myByte = 0;

  public static void main(String[] args) throws InterruptedException, FileNotFoundException {

        int Iterations = Integer.parseInt(args[0]);
        int ThreadSize = Integer.parseInt(args[1]);

        FileArray = new int[Iterations];
        fout = new FileOutputStream("server_testing.csv");

        // fixed pool, unlimited queue
        ExecutorService service = Executors.newFixedThreadPool(ThreadSize);
        ThreadPoolExecutor executor = (ThreadPoolExecutor) service;

        for(int i = 0; i<Iterations; i++) {
          Task t = new Task(i);
          executor.execute(t);
        }

        for(int j=0; j<FileArray.length; j++){
            new PrintStream(fout).println(FileArray[j] + ",");
        }
      }

  private static class Task implements Runnable {

    private int ID;

    public Task(int index) {
      this.ID = index;
    }

    public void run() {
        long Start = System.currentTimeMillis();

        int Size1 = 100000;
        int Size2 = 2 * Size1;
        int Size3 = Size1;

        byte[] list1 = new byte[Size1];
        byte[] list2 = new byte[Size2];
        byte[] list3 = new byte[Size3];

        for(int i=0; i<Size1; i++){
            list1[i] = myByte;
        }

        for (int i = 0; i < Size2; i=i+2)
        {
            list2[i] = myByte;
        }

        for (int i = 0; i < Size3; i++)
        {
            byte temp = list1[i];
            byte temp2 = list2[i];
            list3[i] = temp;
            list2[i] = temp;
            list1[i] = temp2;
        }

        long Finish = System.currentTimeMillis();
        long Duration = Finish - Start;
        TotalTime += Duration;
        FileArray[this.ID] = (int)Duration;
        System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");


        if(Duration < FastestMemory){
            FastestMemory = Duration;
        }
        if (Duration > SlowestMemory)
        {
            SlowestMemory = Duration;
        }
    }
  }
}

Solution

  • Summary

    Below are the original response, update 1, and update 2. Update 1 talks about dealing with the race conditions around the test statistic variables by using concurrency structures. Update 2 is a much simpler way of dealing with the race condition issue. Hopefully no more updates from me - sorry for the length of the response but multithreaded programming is complicated!

    Original Response

    The only difference between the code is that im using the below threadpool

    I would say that is an absolutely huge difference. It's difficult to compare the performance of the two languages when their thread pool implementations are completely different blocks of code, written in user space. The thread pool implementation could have enormous impact on performance.

    You should consider using Java's own built-in thread pools. See ThreadPoolExecutor and the entire java.util.concurrent package of which it is part. The Executors class has convenient static factory methods for pools and is a good higher level interface. All you need is JDK 1.5+, though the newer, the better. The fork/join solutions mentioned by other posters are also part of this package - as mentioned, they require 1.7+.

    Update 1 - Addressing race conditions by using concurrency structures

    You have race conditions around the setting of FastestMemory, SlowestMemory, and TotalTime. For the first two, you are doing the < and > testing and then the setting in more than one step. This is not atomic; there is certainly the chance that another thread will update these values in between the testing and the setting. The += setting of TotalTime is also non-atomic: a test and set in disguise.

    Here are some suggested fixes.

    TotalTime

    The goal here is a threadsafe, atomic += of TotalTime.

    // At the top of everything
    import java.util.concurrent.atomic.AtomicLong;  
    
    ...    
    
    // In PoolDemo
    static AtomicLong TotalTime = new AtomicLong();    
    
    ...    
    
    // In Task, where you currently do the TotalTime += piece
    TotalTime.addAndGet (Duration); 
    

    FastestMemory / SlowestMemory

    The goal here is testing and updating FastestMemory and SlowestMemory each in an atomic step, so no thread can slip in between the test and update steps to cause a race condition.

    Simplest approach:

    Protect the testing and setting of the variables using the class itself as a monitor. We need a monitor that contains the variables in order to guarantee synchronized visibility (thanks @A.H. for catching this.) We have to use the class itself because everything is static.

    // In Task
    synchronized (PoolDemo.class) {
        if (Duration < FastestMemory) {
            FastestMemory = Duration;
        }
    
        if (Duration > SlowestMemory) {
            SlowestMemory = Duration;
        }
    }
    

    Intermediate approach:

    You may not like taking the whole class for the monitor, or exposing the monitor by using the class, etc. You could do a separate monitor that does not itself contain FastestMemory and SlowestMemory, but you will then run into synchronization visibility issues. You get around this by using the volatile keyword.

    // In PoolDemo
    static Integer _monitor = new Integer(1);
    static volatile long FastestMemory = 2000000;
    static volatile long SlowestMemory = 0;
    
    ...
    
    // In Task
    synchronized (PoolDemo._monitor) {
        if (Duration < FastestMemory) {
            FastestMemory = Duration;
        }
    
        if (Duration > SlowestMemory) {
            SlowestMemory = Duration;
        }
    }
    

    Advanced approach:

    Here we use the java.util.concurrent.atomic classes instead of monitors. Under heavy contention, this should perform better than the synchronized approach. Try it and see.

    // At the top of everything
    import java.util.concurrent.atomic.AtomicLong;    
    
    . . . . 
    
    // In PoolDemo
    static AtomicLong FastestMemory = new AtomicLong(2000000);
    static AtomicLong SlowestMemory = new AtomicLong(0);
    
    . . . . .
    
    // In Task
    long temp = FastestMemory.get();       
    while (Duration < temp) {
        if (!FastestMemory.compareAndSet (temp, Duration)) {
            temp = FastestMemory.get();       
        }
    }
    
    temp = SlowestMemory.get();
    while (Duration > temp) {
        if (!SlowestMemory.compareAndSet (temp, Duration)) {
            temp = SlowestMemory.get();
        }
    }
    

    Let me know what happens after this. It may not fix your problem, but the race condition around the very variables that track your performance is too dangerous to ignore.

    I originally posted this update as a comment but moved it here so that I would have room to show code. This update has been through a few iterations - thanks to A.H. for catching a bug I had in an earlier version. Anything in this update supersedes anything in the comment.

    Last but not least, an excellent source covering all this material is Java Concurrency in Practice, the best book on Java concurrency, and one of the best Java books overall.

    Update 2 - Addressing race conditions in a much simpler way

    I recently noticed that your current code will never terminate unless you add executorService.shutdown(). That is, the non-daemon threads living in that pool must be terminated or else the main thread will never exit. This got me to thinking that since we have to wait for all threads to exit, why not compare their durations after they finished, and thus bypass the concurrent updating of FastestMemory, etc. altogether? This is simpler and could be faster; there's no more locking or CAS overhead, and you are already doing an iteration of FileArray at the end of things anyway.

    The other thing we can take advantage of is that your concurrent updating of FileArray is perfectly safe, since each thread is writing to a separate cell, and since there is no reading of FileArray during the writing of it.

    With that, you make the following changes:

    // In PoolDemo
    // This part is the same, just so you know where we are
    for(int i = 0; i<Iterations; i++) {
        Task t = new Task(i);
        executor.execute(t);
    }
    
    // CHANGES BEGIN HERE
    // Will block till all tasks finish. Required regardless.
    executor.shutdown();
    executor.awaitTermination(10, TimeUnit.SECONDS);
    
    for(int j=0; j<FileArray.length; j++){
        long duration = FileArray[j];
        TotalTime += duration;
    
        if (duration < FastestMemory) {
            FastestMemory = duration;
        }
    
        if (duration > SlowestMemory) {
            SlowestMemory = duration;
        }
    
        new PrintStream(fout).println(FileArray[j] + ",");
    }
    
    . . . 
    
    // In Task
    // Ending of Task.run() now looks like this
    long Finish = System.currentTimeMillis();
    long Duration = Finish - Start;
    FileArray[this.ID] = (int)Duration;
    System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");
    

    Give this approach a shot as well.

    You should definitely be checking your C# code for similar race conditions.