Search code examples
google-cloud-platformgoogle-compute-engine

Google Compute Engine healthcheck initial delay appears to be ignored


In our project, we have a gRPC healthcheck configured for a Google Compute Engine instance group. The configuration is similar to as follows:

# ...


resource "google_compute_health_check" "instances_health_check" {
  name = "instances-health-check"

  timeout_sec = 5
  check_interval_sec = 10

  grpc_health_check {
    port = "8087"
  }
}

# Generates instance group for each GCP zone in `local.zone_to_region`.
resource "google_compute_instance_group_manager" "instances" {
  for_each = local.zone_to_region

  base_instance_name = "instance-${each.key}"
  name = "group-${each.key}"
  target_pools = []
  target_size = 0
  wait_for_instances = false
  zone = each.key

  version {
    instance_template = module.instance_templates[each.value[0]].template.self_link
  }

  auto_healing_policies {
    health_check = google_compute_health_check.processor_health_check.id
    initial_delay_sec = 600
  }
}

# ...

When a new instance is spun up, it receives its first healthcheck request in 1-2 minutes after start, which is not enough for our app to become operational. The configured initial_delay_sec seems to be ignored.

This leads to multiple undesirable warnings emitted to our logs. The warnings have this format:

{
    "@type":"type.googleapis.com/compute.InstanceGroupManagerEvent",
    "instanceHealthStateChange":{
        detailedHealthState: "UNHEALTHY",
        // ...
        previousDetailedHealthState: "UNKNOWN"

    }
}

Any way around this?


Solution

  • The initial_delay parameter is a configuration of MIG (Instance Group Manager) autohealing, and not the healthcheck itself.

    The way it works is that

    1. The instance starts being probed almost immediately after it is created by the MIG. You need to make sure that your health check endpoint does not return success until your application is ready.
    2. The MIG uses the health results to auto-heal the instances if it is not healthy. And here the initial_delay parameter comes into play: unhealthy results will be ignored for this period. After that if the VM is still unhealthy, it will be repaired (i.e. recreated).