Search code examples
spring-cloudnetflix-eureka

Eureka Client not reflecting manual status of OUT_OF_SERVICE


I have a manager application that uses Eureka to discover worker applications. Both are using Spring Cloud Netflix and the auto configurations that they provide to do service registration and discovery.

Occasionally the manager marks an instance as OUT_OF_SERVICE and some time later (on the order of minutes) marks the same instance as UP.

The manager discovers instances using the CloudEurekaClient, and then sets its status:

@Autowired
private CloudEurekaClient cloudEurekaClient;

...

InstanceInfo instance = cloudEurekaClient.getNextServerFromEureka(WORKER_SERVICE_NAME, false);
cloudEurekaClient.setStatus(InstanceInfo.InstanceStatus.OUT_OF_SERVICE, instance);
// do some work
cloudEurekaClient.setStatus(InstanceInfo.InstanceStatus.UP, instance);

This seems to work well. The Eureka server status page shows my instances going from UP to OUT_OF_SERVICE:

enter image description here

However, the CloudEurekaClient doesn't seem to know that an instance is OUT_OF_SERVICE. Instead, using the debugger, I have found that the instance has a status of UP and overridenStatus of UNKNOWN:

enter image description here

Note: If I call cloudEurekaClient.getApplication("worker").getInstances() it shows the 4 UP instances, but no mention of the one that is OUT_OF_SERVICE.

Is this expected? I assumed the eureka client would know that an instance is OUT_OF_SERVICE, but that's not the behavior I'm seeing. This causes problems for a health indicator I have that uses the CloudEurekaClient to show the number of UP and OUT_OF_SERVICE instances.


Solution

  • After some digging, the issue appears to be that setting instance status makes an immediate call to the Eureka Server, which is why the server status UI shows the proper status real-time:

    public void setStatus(InstanceStatus newStatus, InstanceInfo info) {
        getEurekaHttpClient().statusUpdate(info.getAppName(), info.getId(), newStatus, info);
    }
    

    However, calling CloudEurekaServer.getNextServerFromEureka() uses a local cache which is only periodically updated on a timer, which is defined by EurekaClientConfig.getRegistryFetchIntervalSeconds().

    So I'm in a race condition where if I set an instance status to OUT_OF_SERVICE and try to query the discovery client for an application before the cache is refreshed, the client and server have different views of the instances. If I wait registryRefreshInterval seconds before asking the client for the next server, it correctly ignores the instance that I manually put into OUT_OF_SERVICE status.