I have a scenario with Spring Boot Zuul as external Gateway and Eureka as Service Discovery, all this running in Kubernetes.
The thing is, I would like to guarantee my service's availability, so when of the instances of my service goes down, I expect Zuul to retry calling one of the other instances, through Eureka.
I tried doing this by following this Ryan Baxter's post. Plus, I tried to follow the tips from here.
The problem is that whatever I make, looks like Zuul is not retrying to make the call. When I remove one of my instances, it keeps returning me a Timeout for this instance, until Eureka addresses get synchronized.
My application.yaml looks like this:
spring:
cloud:
loadbalancer:
retry:
enabled: true
zuul:
stripPrefix: true
ignoredServices: '*'
routes:
my-service:
path: /my-service/**
serviceId: my-service-api
retryable: true
my-service:
ribbon:
maxAutoRetries: 3
MaxAutoRetriesNextServer: 3
OkToRetryOnAllOperations: true
ReadTimeout: 5000
ConnectTimeout: 3000
My service is using Camden SR7 (I also tried SR6):
"org.springframework.cloud:spring-cloud-dependencies:Camden.SR7"
And also Spring-retry:
org.springframework.retry:spring-retry:1.1.5.RELEASE
My application class looks like this:
@SpringBootApplication
@EnableEurekaClient
@EnableZuulProxy
@EnableRetry
public class MyZuulApplication
EDIT:
Making a get through Postman, it brings
{
"timestamp": 1497959364819,
"status": 500,
"error": "Internal Server Error",
"exception": "com.netflix.zuul.exception.ZuulException",
"message": "TIMEOUT"
}.
Taking a look at the Zuul logs, it printed {"level":"WARN","logger_name":"org.springframework.cloud.netflix.zuul.filters.post.SendErrorFilter","appName":...,"message":"Error during filtering","stack_trace":"com.netflix.zuul.exception.ZuulException: Forwarding error [... Stack Trace ...] Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: my-service-api timed-out and no fallback available [... Stack Trace ...] Caused by: java.util.concurrent.TimeoutException: null
Another interesting log that I found:
{"level":"INFO" [...] current list of Servers=[ip_address1:port, ip_address2:port, ip_address3:port],Load balancer stats=Zone stats: {defaultzone=[Zone:[ ... ]; Instance count:3; Active connections count: 0; Circuit breaker tripped count: 0; Active connections per server: 0.0;]
},Server stats: [[Server:ip_address1:port; [ ... ] Total Requests:0; Successive connection failure:0; Total blackout seconds:0; [ ... ]
, [Server:ip_address2:port; [ ... ] Total Requests:0; Successive connection failure:0; Total blackout seconds:0; [ ... ]
, [Server:ip_address3:port; [ ... ] Total Requests:0; Successive connection failure:0; Total blackout seconds:0; [ ... ]
The problem seems to be caused by Hystrix timeout. The default timeout of HystrixCommand is 1000ms and it's not enough for ribbon to retry http request. Try to increase the timeout of hystrix like followings.
hystrix:
command:
default:
execution:
isolation:
thread:
timeoutInMilliseconds: 20000
It will increase whole hystrix command's timeout to 20 seconds. If it works, please adjust above value for your environment. You are using quite big timeout value for read and connect timeout. Therefore, you need to adjust those values with hystrix timeout if needed.