Search code examples
autoscalingazure-cognitive-searchazure-search-.net-sdk

Programmatic scaling of Azure Search indexers


I have Cosmos DB collections which are indexed by the standard Azure Search indexer + datasource pairs. And using WHERE _ts > @HighWaterMark inQuery as it is recommended in docs.

Time to time I need to scale up/down indexers from 1 to N to speedup the indexing process.

For static scaling I can create N pairs of datasource + indexer which will process a separate partition or subset of items by defined in Query e.g. WHERE indexingGroup = <1..N> AND _ts >= @HighWaterMark

But now I need to scale such pairs dynamically. For example I have 1 indexer and I want to create 1 more. I need to update the Query and add WHERE indexingGroup = 1 for the 1st pair, and create a new indexer + datasource which will process the second subset with WHERE indexingGroup = 2.

As result the 1st pair, I assume, will proceed the processing using its HighWaterMark from the previous execution. While the 2nd new pair will start from scratch because the HighWaterMark is 0.

Is there any chance to get the current HighWaterMark value from datasource/indexer and then set it to another?

UPD.1. Scenario

  1. We have hundreds of millions of records of different types. Each type has own indexer (group). Sometimes we are getting a huge amount of new data, thus we need to scale up. Because in Azure Search there is a limitation of parallel indexers (and it is quite low), in our tests we found that some of indexers never start because older ones do not stop for 24h. So the idea is to be able to balance indexers count programmatically.

  2. As we faced with this not so far ago, right now we are experimenting with different amounts of indexers. In our current approach we use ID as partition key, so there are no dedicated indexers per partition(s).

  3. One of the unfrequent (monthly+) scenarios is to index 200M+ items in a limited amount of time. For this we need to add maximum of indexers, complete the indexing, and scale down. After that we have daily 10-20M records at a time with about 3M/h of items per 1 indexer. For other types we have a realtime stream of data to be processed (Cosmos DB upsert throughput is 10-100K). So the main balancing is between this big block of data and streaming. But also we have very minor indexers which should complete in a minimum amount of time (near realtime in terms of Cosmos/Search SLA capabilities)


Solution

  • You can get the high water mark value from the last completed run of an indexer from the finalTrackingState on the Indexer Execution Result. This value can only be cleared of a value via Indexer Reset and cannot be set to a specific value. However, you can accomplish the same effect of running from a specific high water mark by creating or resetting an indexer then changing the datasource query to also include the high water mark value such as:

    WHERE indexingGroup = <1..N> AND _ts >= @HighWaterMark AND _ts >= _LiteralAsCInt64(1579295473)
    

    If you do this you need to remember to remove this value from the query when you reset the indexer if you want it to start from the beginning. Also when scaling down be sure to use the minimum finalTrackingState of each of the indexers to ensure that you don't miss any documents.

    I am on the Azure Cognitive Search team and would like to learn more about your scenario. A few questions.

    1. Why do you need to dynamically scale the indexers? (rather then always using partitioned indexers)
    2. How do you determine the value of indexingGroup? (partition the data)
    3. What kind of indexing throughput do you need for your scenario? (upper bound on number of partitioned indexers)