Search code examples
performancegrailsgrails-ormbulkinsert

Performance decays exponentially when inserting bulk data into grails app


We need to seed an application with 3 million entities before running performance tests. The 3 million entities should be loaded through the application to simulate 3 years of real data.

We are inserting 1-5000 entities at a time. In the beginning response times are very good. But after a while they decay exponentially.

We use at groovy script to hit a URL to start each round of insertions.

  1. Restarting the application resets the response time - i.e. fixes the problem temporally.
  2. Reruns of the script, without restarting the app, have no effect.

We use the following to enhance performance

1) Cleanup GORM after each 100 insertions:

def session = sessionFactory.currentSession
session.flush()
session.clear()
DomainClassGrailsPlugin.PROPERTY_INSTANCE_MAP.get().clear()

(old Ted Naleid trick: http://naleid.com/blog/2009/10/01/batch-import-performance-with-grails-and-mysql)

2) We use GPars for parallel insertions:

GParsPool.withPool {
    (0..<1000).eachParallel {
        def entity = new Entity(...)
        insertionService.insert(entity)
    }
}

Notes

  • When looking at the log output, I've noticed that the processing time for each entity are the same, but the system seems to pause longer and longer between each iteration.
  • The exact number of entities inserted are not important, just around 3 mill, so if some fail we can ignore it.
  • Tuning the number of entities at a time have little or no effect.

Help

I'm really hoping somebody have a good idea on how to fix the problem.

Environment

  • Grails: 2.4.2 (GRAILS_OPTS=-Xmx2G -Xms512m -XX:MaxPermSize=512m)
  • Java: 1.7.0_55
  • MBP: OS X 10.9.5 (2,6 GHz Intel Core i7, 16 GB 1600 MHz DDR3)

Solution

  • The pausing would make me think it's the JVM doing garbage collection. Have you used a profiler such as VisualVM to see what time is being spent doing garbage collection? Typically this will be the best approach to understanding what is happening with your application within the JVM.

    Also, it's far better to load the data directly into the database rather than using your application if you are trying to "seed" the application. Performance wise of course.

    (Added as answer per comment)