API Gateway making calls to Application Load Balancer endpoint in front of a micro service deployed in Amazon ECS

I have micro services deployed in an ECS Cluster with an Application load balancer and Target group configured as a front end to it.

Now, one issue that I am having with the application load balancer is that sometimes the response takes more than 3 seconds. I am trying to investigate what is going on with it.

Now, when I create a Resource and a POST method in API Gateway with the HTTP Endpoint configured as the Application Load balancer of the Service, what I am seeing is that in some cases it gives the following error:

Status: 504
Latency: 3026 ms
Response Body
{
  "message": "Network error communicating with endpoint"
}


Execution log for request test-request
Mon Feb 06 21:47:00 UTC 2017 : Starting execution for request: test-invoke-request
Mon Feb 06 21:47:00 UTC 2017 : HTTP Method: POST, Resource Path: /find
Mon Feb 06 21:47:00 UTC 2017 : Method request path: {}
Mon Feb 06 21:47:00 UTC 2017 : Method request query string: {}
Mon Feb 06 21:47:00 UTC 2017 : Method request headers: {}
Mon Feb 06 21:47:00 UTC 2017 : Method request body before transformations: 
Mon Feb 06 21:47:00 UTC 2017 : Endpoint request URI: http://microservice-alb-xxxxxxx.us-east-1.elb.amazonaws.com/find
Mon Feb 06 21:47:00 UTC 2017 : Endpoint request headers: {x-amzn-apigateway-api-id=hw4gf0e5ui, Accept=application/json, User-Agent=AmazonAPIGateway_hxyf0t7ui, X-Amzn-Trace-Id=Root=1-456twed4-97d26555a0abcd123413ad35}
Mon Feb 06 21:47:00 UTC 2017 : Endpoint request body after transformations: 
Mon Feb 06 21:47:03 UTC 2017 : Execution failed due to an internal error
Mon Feb 06 21:47:03 UTC 2017 : Method completed with status: 504

A few times it works fine and gives the right response with status code 200 and few times it gives the above response. The same is the behavior when executing the Test in API Gateway as well as when the resource is deployed to a stage and is accessed through the stage.

I have turned the access logs for the application load balancer as well as enabled the cloud watch logs by overriding the stage settings in API gateway. But I do not get any detailed info on this error.

How can I troubleshoot why this error is thrown in API gateway?

Thanks,

Ranjith

Solution

The only time I've seen 504s with an ALB is when the ALB was deployed in front of a cluster with only one availability zone. ALB's require multiple AZs and you will get random timeouts when the ALB tries to search for routes in the other AZs.

If you rule out the ALB, then something may be going on in your API Gateway code. I'd simplify things there, do you have a custom validator? If so turn of caching of the credentials while debugging. It can also be easier to test with the new passthrough mappings if you are not using that already.