GridGain application that is slower than a multithreaded application on one machine

I have implemented my first GridGain application and am not getting the performance improvements I expected. Sadly it is slower. I would like some help in improving my implementation so it can be faster.

The gist of my application is I am doing a brute force optimization with millions of possible parameters that take a fraction of a second for each function evaluation. I have implemented this by dividing up the millions of iterations into a few groups, and each group is executed as one job.

The relevant piece of code is below. the function maxAppliedRange calls function foo for every value in the range x, and returns the maximum, and the result becomes the maximum of all the maximums found by each job.

  scalar {
    result = grid !*~
      (for (x <- (1 to threads).map(i => ((i - 1) * iterations / threads, i * iterations / threads)))
        yield () => maxAppliedRange(x, foo), (s: Seq[(Double, Long)]) => s.max)
  }

My code can chose between a multi-threaded execution on one machine or use several GridGain nodes using the code above. When I run the gridgain version it starts out like it is going to be faster, but then a few things always happen:

One of the nodes (on a different machine) misses a heartbeat, causing the node on my main computer to give up on that node and to start executing the job a second time.
The node that missed a heartbeat continues doing the same job. Now I have two nodes doing the same thing.
Eventually, all jobs are being executed on my main machine, but since some of the jobs started later, it takes way longer for everything to finish.
Sometimes an exception gets thrown by GridGain because a node timed out and the whole task gets failed.
I get annoyed.

I tried setting it up to have many jobs so if one failed then it wouldn't be as big of a deal, but when I do this I end up with many jobs being executed on each node. That puts a much bigger burden on each machine making it more likely for a node to miss a heartbeat, causing everything to go downhill faster. If I have one job per CPU then if one job fails, a different node has to start over from the beginning. Either way I can't win.

What I think would work best is if I could do two things:

Increase the timeout for heartbeats
Throttle each node so that it only does one job at a time.

If I could do this, I could divide up my task into many jobs. Each node would do one job at a time and no machine would become overburdened to cause it to miss a heartbeat. If a job failed then little work would be lost and recovery would be quick.

Can anyone tell me how to do this? What should I be doing here?

Solution

Now I have it working correctly. In my situation for my application I am getting about a 50% speed-up over a multithreaded application on one machine, but that is not the best that I can do. More work is to be done.

To use gridgain, it seems that the configuration file is critical to getting everything working. This is where node behavior is set and must match your application's needs.

One thing I needed in my xml configuration file is this:

    <property name="discoverySpi">
        <bean class="org.gridgain.grid.spi.discovery.multicast.GridMulticastDiscoverySpi">
            <property name="maxMissedHeartbeats" value="20"/>
            <property name="leaveAttempts" value="10"/>
        </bean>
    </property>

This sets the maximum heartbeats that can be missed before a node is considered missing. I set this to a high value because I kept having a problem of nodes leaving and coming back a few seconds later. Alternatively, instead of using multicasting I could have fixed the IPs of the machines with running nodes using other properties in the in the config file. I didn't do this but if you are using the same machines over and over it probably would be more reliable.

The other thing I did is:

    <property name="collisionSpi">
        <bean class="org.gridgain.grid.spi.collision.jobstealing.GridJobStealingCollisionSpi">
            <property name="activeJobsThreshold" value="2"/>
            <property name="waitJobsThreshold" value="4"/>
            <property name="maximumStealingAttempts" value="10"/>
            <property name="stealingEnabled" value="true"/>
            <property name="messageExpireTime" value="1000"/>
        </bean>
    </property>

    <property name="failoverSpi">
        <bean class="org.gridgain.grid.spi.failover.jobstealing.GridJobStealingFailoverSpi">
            <property name="maximumFailoverAttempts" value="10"/>
        </bean>
    </property>

For the first one, the activeJobsThreshold value tells the node how many jobs it can run at the same time. This is a better way of doing throttling than changing the number of threads in the executor service. Also, it does some load balancing and idle nodes can 'steal' work from other nodes to get everything done faster.

There are better ways to do this also. Gridgain can do size the jobs based on the measured performance of each node, apparently, which would improve overall performance, especially if you have fast and slow computers in the grid.

For the future I am going to study the configuration file and compare that to the javadocs to learn all about all the different options, to get this to run even faster.