HibernateSearch : Reindex 50 million rows from a single table into Elastic Search

We are currently using the default settings (10 objects to load per query, per thread) of the Mass Indexer with 7 threads to reindex data from 1 table (8-10 fields) into elastic search. The size of the table is currently at 25 million and will grow to a few hundred millions.

MassIndexer indexer = searchSession.massIndexer(Entity.class)
            .threadsToLoadObjects(7);

indexer.start()
     .thenRun(() ->
         log.info("Mass Indexing Entity Complete")
     )
     .exceptionally(throwable -> {
         log.error("Mass Indexing Entity Failed", throwable);
         return null;
     });

The database is a Postgres on RDS, and we are using AWS Elastic Search. Hibernate Search version is 6.

Recently we hit a bottleneck during the reindexing process as it ran for hours with 20 million rows in the table. One of the reason was that we had a connection pool of 10 max connections. With the current mass indexer setup (7 threads) it only left 2 connections (1 for Id Lookup + 7 for Entity lookup) for other operations causing timeouts waiting for a connection. We will increase the pool size to 20 and test.

What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings? Or should we look at other strategies? What has worked in the past for someone with same requirements?

UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?

And, what is the use of idFetchSize? Looks like it is not used in the indexing process.

Solution

What is the best strategy to reindex very large datasets? Can MassIndexer scale to this high volume with some configuration settings?

With that many entities, things are definitely going to take more than just a few minutes.

Whether it can scale... the thing is, the mass indexer is just a middleman between your database and Elasticsearch. Assuming your database scales, and Elasticsearch scales, then the only thing required for the mass indexer to scale is to do more work in parallel. And you can control that.

Now, you probably meant "can it reindex in a satisfying amount of time", and that of course will depend on what your expectations are, as well as how much effort you put into tuning it.

The performance of mass indexing will be affected by the configuration you pass to the mass indexer, of course, but also by the schema and data of your entities, your RDBMS and its configuration, your Elasticsearch cluster and its configuration, the machines they run on, ... Really, no one knows what's possible: the only way to know is to try, assess the results, tune, and iterate.

I'd advise to first concentrate on addressing lazy loading issues, since those will have a tremendous impact of performance; be sure to set hibernate.default_batch_fetch_size in order to reduce the impact of lazy loading on performance.

Then, I can't do much more than repeating what the reference documentation says:

The MassIndexer was designed to finish the re-indexing task as quickly as possible, but there is no one-size-fits-all solution, so some configuration is required to get the best of it. Performance optimization can get quite complex, so keep the following in mind while you attempt to configure the MassIndexer:

Always test your changes to assess their actual effect: advice provided in this section is true in general, but each application and environment is different, and some options, when combined, may produce unexpected results.

Take baby steps: before tuning mass indexing with 40 indexed entity types with two million instances each, try a more reasonable scenario with only one entity type, optionally limiting the number of entities to index to assess performance more quickly.

Tune your entity types individually before you try to tune a mass indexing operation that indexes multiple entity types in parallel.

Beyond tuning the mass indexer, remember that it only loads data from the database to push it to Elasticsearch. So sure, the mass indexer might be the bottleneck, but so could be the database or Elasticsearch, if they are under-dimensioned. Make sure that both can provide satisfying throughput as well: decent machines, clustering if necessary, server-side configuration, ...

Anyway, there are many things you can do: before you do, try to find out what the bottleneck is. Is your database always at 100% CPU? Then tune your database: change settings, use a beefier machine, ... Are Elasticsearch I/O clearly reaching their limits? Then tune Elasticsearch: change settings, add more nodes, ... Are both Postgresql and Elasticsearch doing just fine? Then maybe you should have even more DB connections, or more ES connections, or more threads in your mass indexer. Or maybe it's something else; performance is hard.

Or should we look at other strategies?

I would leave that as a last resort. If you don't understand what is wrong exactly with the performance of the mass indexer, then you're unlikely to find a better solution.

If you don't trust the MassIndexer to do a good job, you can try doing it yourself. Set up a thread that load IDs, and other threads that load the corresponding entities, then index them manually. That's not exactly simple to get right, but it's possible.

If you do just that, I doubt you will improve anything. But, assuming entity loading is the bottleneck, and not indexing (you must check that first!), I imagine that you could get better throughput by leveraging the specifics of your database:

If lazy loading seems to be the problem, you could use entity graphs to make sure all parts of your entity that are indexed will be loaded eagerly. The MassIndexer cannot currently do that, though hopefully it will someday (HSEARCH-521).
If there are some JDBC query hints that improve performance in your case, you could try setting them.
If it's more than capable of handling the load, and the bottleneck seems to be the processing of entities into documents, then you can try to partition the IDs and run your "custom indexing process" on multiple machines. E.g. reindex IDs 1 to 25,000,000 on one machine, and IDs 25,000,001 to 50,000,000 on another. You couldn't do that with the mass indexer, as it does not allow filtering the IDs (at least not in Hibernate Search 6.0, but it will in 6.1: HSEARCH-499)

UPDATE: Also it looks like the IDLoader thread is not batched, so for 50 million rows, it will load all 50 million IDs in memory in 1 query?

No, ids are loaded in batches. Then each batch is pushed to an internal queue, and consumed by a loading thread. The size of batches is controlled by batchSizeToLoadObjects.

The one exception is MySQL, whose default configuration is to load the whole result set in memory (don't ask me why), but that doesn't affect PostgreSQL. And anyway, that can be fixed (see below).

More information about the parameters here.

And, what is the use of idFetchSize? Looks like it is not used in the indexing process.

This is the JDBC fetch size. IDs are retrieved using a scroll (cursor), and the JDBC fetch size is the size of result pages (~ low-level buffers) for this scroll in your JDBC driver.

To be honest, it's mostly useful for MySQL (and perhaps MariaDB?), whose JDBC driver will load all results in memory even if we're using a cursor, unless the fetch size is set to Integer#MIN_VALUE. I know, it's weird.