amazon-web-services architecture amazon-ecs aws-fargate

Dynamic Stage Routing / Multi-Cluster Setup with Fargate

I'm having a fargate cluster with a service having two containers:

a container running nginx for terminating mTLS (it accepts a defined list of CAs) and forwarding calls to the app container with the DN of the client certificate
a Spring App running on tomcat which does fine-grained authorization checks (per route & HTTP method) based on the incoming DN via a filter

The endpoints from nginx are exposed to the internet via a NAT gateway.

Infrastructure is managed via terraform and rolling out a new version is done via a task definition replacement which then points to the new images in ECR. ECS takes care and starts the new containers and then switches the DNS to those within 5 to 10 minutes.

Problems with this setup:

I can't do canary or blue/green deployments
If the new app version has issues (app is not able to start, we have huge error spikes, ...) the rollback will take a lot of time.
I can't test my service integrated without applying a new version and therefore probably breaking everything.

What I'm aiming for is some concept with multiple clusters and a routing based on a specific header. So that I can spin up a new cluster with my new app version and the traffic will not be routed to this version until I either a) send a specific header or b) completely switch to the new version with for example a specific SSM parameter.

Basically the same you can do easily on CloudFront with Lambda@Edge for static frontend deployments (using multiple origin buckets and switching the origin with lambda based on the incoming request).

As I'm having the requirement for mTLS and those fine-grained authorisations I'm neither able to use a standard ALB nor API Gateway.

Are there any other smart solutions for my requirements?

Solution

To solve this question finally, we wen't on to replicate the task definitions (xxx-blue and xxx-green) & ELBs and creating two different A records. The deployment process:

find out which task definition is inactive by checking the weights of both CNAMES (one will have 0% weight)
replacing the inactive definition containing the new images at ECR.
waiting for apps to become healthy
switching the traffic via the CNAME records to ELB of the replaced task definition
running integration tests and verifying that there are no log anomalies
(Manually triggered) Setting the desired tasks at the other task definition to zero to scale the old version down. Otherwise, if there is unexpected behaviour the A records can be used to switch the traffic back to the ELB of the old task.

What we didn't achieve with this: having client-based routing to different tasks.