spring-boot kubernetes netflix-eureka netflix-zuul

Zuul retry configuration is not working with Eureka

I have a scenario with Spring Boot Zuul as external Gateway and Eureka as Service Discovery, all this running in Kubernetes.

The thing is, I would like to guarantee my service's availability, so when of the instances of my service goes down, I expect Zuul to retry calling one of the other instances, through Eureka.

I tried doing this by following this Ryan Baxter's post. Plus, I tried to follow the tips from here.

The problem is that whatever I make, looks like Zuul is not retrying to make the call. When I remove one of my instances, it keeps returning me a Timeout for this instance, until Eureka addresses get synchronized.

My application.yaml looks like this:

spring:
  cloud:
    loadbalancer:
      retry:
        enabled: true

 zuul:
  stripPrefix: true
  ignoredServices: '*'
  routes:
    my-service:
      path: /my-service/**
      serviceId: my-service-api
  retryable: true

 my-service:
  ribbon:
    maxAutoRetries: 3
    MaxAutoRetriesNextServer: 3
    OkToRetryOnAllOperations: true
    ReadTimeout: 5000
    ConnectTimeout: 3000

My service is using Camden SR7 (I also tried SR6):

"org.springframework.cloud:spring-cloud-dependencies:Camden.SR7"

And also Spring-retry:

org.springframework.retry:spring-retry:1.1.5.RELEASE

My application class looks like this:

@SpringBootApplication
@EnableEurekaClient
@EnableZuulProxy
@EnableRetry
public class MyZuulApplication

EDIT:

Making a get through Postman, it brings

{
    "timestamp": 1497959364819,
    "status": 500,
    "error": "Internal Server Error",
    "exception": "com.netflix.zuul.exception.ZuulException",
    "message": "TIMEOUT"
}.

Taking a look at the Zuul logs, it printed {"level":"WARN","logger_name":"org.springframework.cloud.netflix.zuul.filters.post.SendErrorFilter","appName":...,"message":"Error during filtering","stack_trace":"com.netflix.zuul.exception.ZuulException: Forwarding error [... Stack Trace ...] Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: my-service-api timed-out and no fallback available [... Stack Trace ...] Caused by: java.util.concurrent.TimeoutException: null

Another interesting log that I found:

{"level":"INFO" [...] current list of Servers=[ip_address1:port, ip_address2:port, ip_address3:port],Load balancer stats=Zone stats: {defaultzone=[Zone:[ ... ];    Instance count:3;   Active connections count: 0;    Circuit breaker tripped count: 0;   Active connections per server: 0.0;]
},Server stats: [[Server:ip_address1:port;  [ ... ] Total Requests:0;   Successive connection failure:0;    Total blackout seconds:0;   [ ... ]
, [Server:ip_address2:port; [ ... ] Total Requests:0;   Successive connection failure:0;    Total blackout seconds:0;   [ ... ]
, [Server:ip_address3:port; [ ... ] Total Requests:0;   Successive connection failure:0;    Total blackout seconds:0;   [ ... ]

Solution

The problem seems to be caused by Hystrix timeout. The default timeout of HystrixCommand is 1000ms and it's not enough for ribbon to retry http request. Try to increase the timeout of hystrix like followings.

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 20000

It will increase whole hystrix command's timeout to 20 seconds. If it works, please adjust above value for your environment. You are using quite big timeout value for read and connect timeout. Therefore, you need to adjust those values with hystrix timeout if needed.