Search code examples
azure-cosmosdbazure-cosmosdb-changefeed

Azure cosmos changefeed Processor options


Changefeed Processor options are well described here -

I have few questions on that -

  1. leaseRenewInterval: Suppose an instance could not renew its lease within 17s (default lease renew interval), will the lease be removed from that instance? Or feed will wait till leaseExpirationInterval to remove the lease from it and give it a chance to reacquire lease within 60s?

  2. Will leaseRenew by default happens after checkpoint, or both are independent? i.e. leaseRenew can happen on separate thread after leaserenewinterval, while other thread is still working on a batch?

  3. We have seen the error: failed to checkpoint for owner 'null' with continuation token. How this can happen? Why owner can become null?

  4. We have also seen the exception LeaseLostException. Can this happen even if the pod/instance is not down? We are not expecting any load balance as only 1 physical partition is there, but want our system to be fault tolerant, so we do have multiple instances running where all other except 1, will always wait for lease to acquire.

  5. There are few instances where we can see, at the same time, 3 pods/instance having lease of same physical partition, or we can say, they acquired same lease. (We can have at max 1 Physical Partition, (TTL for document is 3 days and storage is less, so we are not expecting more than 1 physical partition)). How this can happen?

EDITS:

Current Settings:

leaseRenewInterval : 17s

leaseAcquireInterval: 13s

leaseExpirationInterval: 60s

feedPollDelay: 2s [only this is not the default]

ChangeFeed Processor version:

  • We are using below in our maven
        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-cosmos</artifactId>
            <version>4.8.0</version>
        </dependency>

So, I can assume the CFP version is 4.8.0


Solution

    1. Leases when not renewed are not removed by the current instance. Other instances can "think" that the lease was not renewed because the current owner crashed, so they will "steal" them. Normally happens when the lease is not accessed/updated before the expiration time.
    2. Independent. There could be no checkpoints (no new changes) and lease still would get renewed.
    3. That sounds like the lease was released and then attempted to checkpoint. Not sure which CFP version you are using or which is your interval configurations.
    4. Are you customizing any of the intervals? If so, that could lead to a lease being lost (detected as expired by other instance).
    5. Same question as before, this could happen either during load balancing or because leases are being detected expired.

    Please share which CFP version you are using and what are the options. Normally, unless you are very certain what you are doing, I don't recommend changing any of the intervals.

    EDIT: Based on the new information. I am not familiar with the Java CFP, but when the number of instances is higher than leases, load balancing a lease across other instances while not ideal, shouldn't be a problem, because the lease will still be owned and processed by 1 machine. The only recommendation I'd try is to use the latest maven package version. There are fixes on CFP on newer version (https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sdk-java-v4#4140-2021-04-06), so try 4.15.0.