How should I pick ScheduleToStartTimeout and StartToCloseTimeout values for ActivityOptions

There are four different timeout options in the ActivityOptions, and two of those are mandatory without any default values: ScheduleToStartTimeout and StartToCloseTimeout.

What considerations should be made when selecting values for these timeouts?

Solution

As mentioned in the question, there are four different timeout options in ActivityOptions, and the differences between them may not be super clear to a new Cadence user. Let’s first briefly explain what those are:

ScheduleToStartTimeout: This configuration specifies the maximum duration between the time the Activity is scheduled by a workflow and it’s picked up by an activity worker to start executing it. In other words, it configures the time a task spends in the queue.
StartToCloseTimeout: This one specifies the maximum time taken by an activity worker from the time it fetches a task until it reports the completion of it to the Cadence server.
ScheduleToCloseTimeout: This configuration specifies an end-to-end timeout duration for an activity from the time it is scheduled by the workflow until it is completed by an activity worker.
HeartbeatTimeout: If your activity is a heartbeating activity, this configuration basically specifies the maximum duration the Cadence server would wait for a heartbeat before assuming the activity worker has failed.

How to select a proper timeout value

Picking the StartToCloseTimeout is fairly straightforward when you know what it does. Essentially, you should make this long enough so that the activity can complete under normal circumstances. Therefore, you should account for everything that can affect the time taken by an activity worker the latency of your down-stream (ie. services, networking etc.). On the other hand, you should aim to keep this value as small as it’s feasible to make your end-to-end system more responsive. If you can’t make this timeout less than a couple of minutes (ideally 1 minute or less), you should consider using a HeartbeatTimeout config and implement heartbeating in your activity.

ScheduleToCloseTimeout is also easy to understand, but it is more common to face issues caused by picking a less-than-ideal value here. Therefore, it’s important to ensure that a moment to pay some extra attention to this configuration.

Basically, you should consider everything that can create a backlog in the activity task queue. Some common events that contribute to a backlog are:

Reduced worker pool throughput due to deployments, maintenance or network-related issues.
Down-stream latency spikes that would increase the time it takes to complete each activity task, which then reduces the throughput of the worker pool.
A significant spike in the number of workflow instances that schedule the activity; especially if one of the upstream services is also an asynchronous queue/stream processor which can create its own backlog and suddenly start processing it at a very high-volume.

Ideally, no activity should timeout while waiting in the task queue, especially if the queue is backed up and the activity is configured to be retried. Because the retries would add more activity tasks to the queue and subsequently make it harder to recover from backlog or make it even worse. On the other hand, there are many use cases where business requirements really limit the total time the system can take to process an activity. Therefore, it’s usually not a bad idea to aim for a high ScheduleToCloseTimeout value as long as the business requirements allow. Depending on your use case, it might not make sense to keep your activity in the queue for more than a few minutes or it might be perfectly fine to keep it there for several days before timing out.