amazon-web-services amazon-ec2 amazon-elastic-beanstalk

How do I debug instances that fail to deploy with Elastic Beanstalk?

I just started having issues deploying a new version of my .NET Core application to AWS EB. AWS attempts to deploy it, but the instance does not respond and times out after about 15 minutes. The error message is:

WARN The following instances have not responded in the allowed
command timeout time (they might still finish eventually on their own): [instanceID]

The operation is then aborted.

It is a relatively small app. Doesn't need a ton of resources. Doesn't take long to startup typically.

I never get any logs from these instances in Elastic Beanstalk or CloudWatch. I'm not sure how to debug them. Where would I find logs about their failures after they are terminated?

Things I Tried:

I tried rebuilding my elastic beanstalk environment and that failed at first as well. I then raised it from a t2 micro instance to a t2 small and it came online. But then I tried deploying an older working application version and it failed to deploy that.
I checked AWS to see if they were reporting service issues but it said everything was good
I tried changing the deployment type (all at once, immutable, rolling).
I checked various security group stuff. All seemed normal.
I spun up a new environment with the sample .NET Core app and all the default config and got the same problem! Then I changed it to a t2-small and it ran. Haven't tried changing the application version yet.

What can I do to debug this? I see the terminated instances in EC2. Can I spin them up and SSH into them to inspect them in some way?

Edit:

I disabled rolling updates in my eb environment so that I could SSH into the working instance and watch what happens when I tell it to deploy another application version. It never even uploads the app. The eb-engine.log has no new logs. However the cfn-hup.log has the following:

2023-02-27 06:13:23,281 [INFO] command processing is alive.
2023-02-27 06:14:23,282 [WARNING] Timeout of 60 seconds breached
2023-02-27 06:14:23,282 [ERROR] Client-side timeout
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 193, in _retry
    return f(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 267, in _timeout
    "Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError
2023-02-27 06:15:24,249 [WARNING] Timeout of 60 seconds breached
2023-02-27 06:15:24,249 [ERROR] Client-side timeout
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 193, in _retry
    return f(*args, **kwargs)
  File "/usr/lib/python3.7/site-packages/cfnbootstrap/util.py", line 267, in _timeout
    "Execution did not succeed after %s seconds" % duration)
cfnbootstrap.util.TimeoutError

Edit 2:

I found the source of the issue, but I don't understand it so would rather someone else answer with some clarity on the matter:

I had a VPC Endpoint for SQS configured to allow some lambdas to communicate with SQS. Something about the Endpoint was causing CloudFormation's cfn-hup service to timeout. From what I understand, this is the service responsible for reacting to configuration changes on each instance.

Once I deleted the Endpoint, it worked. But now I fear my lambda's won't work. I still need to test them though. The VPC has a gateway to the internet so I'm not sure why the Endpoint was required for them in the first place.

Solution

Because Elastic Beanstalk failed to provide any logs to CloudWatch and logs could not be requested successfully, this means we need to log into an instance to retrieve logs manually.

We do this with an EC2 Key Pair. It's easy to generate this from the AWS Console. Then apply this Key Pair to your instances via the elastic beanstalk configuration views of your app. EB usually applies a security group change when you do this so you can connect via SSH. However you may have to do it manually depending on how custom your configuration is.

Now log into the instance via an SSH client of your choice. The logs can be found under /var/log.

In this case, the only indicator of an issue was in the /var/log/cfn-hup.log file. This is a service responsible for listening to CloudFormation config changes on the instance. The errors indicated it was timing out trying to retrieve events from SQS which CloudFormation (and thereby EB) uses to implement changes to instances of a stack. Without this working, the instance will not even attempt to apply any changes and very few eb operations will succeed.

The cause of this was a VPC Endpoint for SQS that was created in the same VPC. It's unclear to me why this was a problem, but removing the Endpoint made everything work as expected.