In our project, we have a gRPC healthcheck configured for a Google Compute Engine instance group. The configuration is similar to as follows:
# ...
resource "google_compute_health_check" "instances_health_check" {
name = "instances-health-check"
timeout_sec = 5
check_interval_sec = 10
grpc_health_check {
port = "8087"
}
}
# Generates instance group for each GCP zone in `local.zone_to_region`.
resource "google_compute_instance_group_manager" "instances" {
for_each = local.zone_to_region
base_instance_name = "instance-${each.key}"
name = "group-${each.key}"
target_pools = []
target_size = 0
wait_for_instances = false
zone = each.key
version {
instance_template = module.instance_templates[each.value[0]].template.self_link
}
auto_healing_policies {
health_check = google_compute_health_check.processor_health_check.id
initial_delay_sec = 600
}
}
# ...
When a new instance is spun up, it receives its first healthcheck request in 1-2 minutes after start, which is not enough for our app to become operational. The configured initial_delay_sec
seems to be ignored.
This leads to multiple undesirable warnings emitted to our logs. The warnings have this format:
{
"@type":"type.googleapis.com/compute.InstanceGroupManagerEvent",
"instanceHealthStateChange":{
detailedHealthState: "UNHEALTHY",
// ...
previousDetailedHealthState: "UNKNOWN"
}
}
Any way around this?
The initial_delay
parameter is a configuration of MIG (Instance Group Manager) autohealing, and not the healthcheck itself.
The way it works is that
initial_delay
parameter comes into play: unhealthy results will be ignored for this period. After that if the VM is still unhealthy, it will be repaired (i.e. recreated).