Search code examples
amazon-web-servicesamazon-elastic-beanstalkamazon-cloudfrontdeep-linkingaws-application-load-balancer

CloudFront reports 5xx errors for apple-app-site-association (deep linking) even though the origin server is returning HTTP 200


When we were running on the retired Tomcat 8 with Java 8 running on 64bit Amazon Linux/3.3.8 using a classic load balancer we never saw HTTP 5xx errors.

That platform has been retired so we created a new environment 'Tomcat 8.5 with Corretto 11 running on 64bit Amazon Linux 2/4.1.3' and during this transition we migrated away from the AWS Classic Load Balancer to their newer Application Load Balancer.

Since then, everything has been running pretty smoothly with two exceptions

1 - Any URLs referencing myapp.com//rest/something failed (double-slash needed to be removed, I'm not sure why that was suddenly an issue - but it's resolved by a simple code tweak that only affected our UAT testing)

2 - I've noticed a bunch of HTTP 5xx errors being shown in the CloudFront portal. It's this that I'm focusing on in this question.

Table - Reports & Analytics > Popular Objects

You'll notice that there are 2xx responses too 🤔 so this rules out most common issues about SSL being configured incorrectly - I would have expected them all to fail, not 50% of them.

I'm seeing 2-4% error rate and I'm assuming from the popular-objects table they're all deep-link related.

I'm consistently seeing 2-4% error rate

I have verified that accessing the deep-link files via the browser (and curl) the page returns with HTTP 200 status. I've tried through the CDN and directly to the load-balancer using the AWS public elastic beanstalk URL.

I have seen reports that miss-configuring SSL can cause these 502 errors, however I have set up multiple behaviours for different URL paths, they all use the same SSL certificates. In addition you can see from the first screenshot that around 50% of the requests hit the cache and 4,300 are HTTP success 2xx.

I've invalidated the cache and after 5-10 minutes the rate doesn't get any worse, so I have to conclude that the CDN and origin are communicating fine, at least half of the time.

I have also seen reports that server-side re-directs (HTTP 301) can cause a HTTP 5xx from CloudFront, but I've verified that for the deep-link URLs (e.g. apple-app-site-association) it's a static HTML file, with no redirect filters getting in the way.

I've tried to compare the CloudFront logs to compare those with HTTP 2xx and 5xx responses, but there's no obvious pattern that I can see to explain it. For example I see errors and success with the same SSL protocols/ciphers (I don't know this area very well though!), below is only a sample of a few in each HTTP response category

502

  13: 2020-12-27    00:00:03    AMS54-C1    1304    [ip-redacted]   GET d2yrbvancsuyx.cloudfront.net    /apple-app-site-association 502 -   swcd%20(unknown%20version)%20CFNetwork/1126%20Darwin/19.5.0 -   -   Error   TC10VGvkak58IlwqwCXpG9_GiR3HZR5vaouaC3AhiU6U5vFKbItI5g==    mycompany.com   https   230 0.073   -   TLSv1.3 TLS_AES_128_GCM_SHA256  Error   HTTP/1.1    -   -   54519   0.073   OriginConnectError  text/html   951 -   -

   50: 2020-12-27   00:00:05    WAW50-C1    1304    [ip-redacted]   GET d2yrbvancsuyx.cloudfront.net    /.well-known/apple-app-site-association 502 -   swcd%20(unknown%20version)%20CFNetwork/976%20Darwin/18.2.0  -   -   Error   lCjWcds-6t1jOt1GI1mII-7DoPVKEE8mIxtT5sGZpWN7vj6t2gqBcQ==    mycompany.com   https   241 0.096   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error   HTTP/1.1    -   -   63785   0.095   OriginConnectError  text/html   951 -   -

   51: 2020-12-27   00:00:05    WAW50-C1    1312    [ip-redacted]   GET d2yrbvancsuyx.cloudfront.net    /apple-app-site-association 502 -   swcd%20(unknown%20version)%20CFNetwork/976%20Darwin/18.2.0  -   -   Error   g8Zj46gI3HMK3KJehze1u9WYMlxCl8dlIjc3vZFat-Jx3HmZD_I17w==    mycompany.com   https   229 0.050   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error   HTTP/1.1    -   -   63785   0.050   Error   text/html   951 -   -

200's


   23: 2020-12-27   00:00:08    LHR3-C2 598 [ip-redacted]   GET d2yrbvancsuyx.cloudfront.net    /.well-known/apple-app-site-association 200 -   swcd%20(unknown%20version)%20CFNetwork/1128.0.1%20Darwin/19.6.0 -   -   Hit tdbpQ0zxszX4y70H9vniecKe9HP3xwd_KeI5SjrlckgrKNgsTJJFdA==    www.mycompany.com   https   250 0.001   -   TLSv1.3 TLS_AES_128_GCM_SHA256  Hit HTTP/1.1    -   -   10372   0.001   Hit -   193 -   -

   45: 2020-12-27   00:00:11    AMS54-C1    599 [ip-redacted]   GET d2yrbvancsuyx.cloudfront.net    /.well-known/apple-app-site-association 200 -   swcd%20(unknown%20version)%20CFNetwork/1126%20Darwin/19.5.0 -   -   Hit QpbX2mGlhzXZR1gBC-HaZfBA-q5VWUC6t4NQgb6w3At4sCGhIz8ihQ==    www.mycompany.com   https   246 0.001   -   TLSv1.3 TLS_AES_128_GCM_SHA256  Hit HTTP/1.1    -   -   54526   0.001   Hit -   193 -   -

   53: 2020-12-27   00:00:07    WAW50-C1    599 [ip-redacted]   GET d2yrbvancsuyx.cloudfront.net    /.well-known/apple-app-site-association 200 -   swcd%20(unknown%20version)%20CFNetwork/976%20Darwin/18.2.0  -   -   Hit UjQtTqnrlVbupxZmXj8RxwwISCfXgJ8viMD38vvEYXdmO-UWcFjk3A==    www.mycompany.com   https   245 0.002   -   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Hit HTTP/1.1    -   -   63787   0.002   Hit -   193 -   -

When I launch our app the deep-link URLs are being processed properly; instead of displaying a browser the app is launched and the details are rendered as expected. I've even deleted the app and re-installed it from the app store and the deep-linking is registered as expected.

The elastic beanstalk has been setup with an Apache HTTP server (rather than NGINX), it's hosted in Europe/Ireland with an SSL certificate to match *.mycompany www.mycompany and a few other sub-domains. I can access this directly using the elastic beanstalk public URL and a certificate warning is given, but that's to be expected because the cert is for mycompany and not for mycomapny.eu-west-1.elasticbeanstalk.com - inspecting the certificate shows it's valid (not expired) and for the domain mycompany.com, I have since added this to my trust store to proceed viewing the file - it is returned fine with HTTP 200 as expected.

The CloudFront CDN unfortunately doesn't have an option to reference the AWS EU/Ireland SSL certificates, so I've used the AWS certificate manager (ACM) to generate SSL certificates via US/East (North Virginia).

Internally CloudFront retrieves the data from the origin and is set up to use HTTP or HTTPs as appropriate, it'll then access the origin using the EU/Ireland SSL certificate.

Like I said, this all works fine for all the other CDN behaviours, but for some reason 5xx are shown in the popular objects table (which I believe are all 502 errors) only for the deep-link files.

The application logs don't show any issues, but I presume they don't even reach the origin hence the 5xx errors.

Does anyone know how I can resolve the 5xx errors through CloudFront --> Application Load Balancer --> Apache --> Our static HTML pages ?

To be clear, we didn't see this issue when we were using CloudFront --> CLASSIC Load Balancer.

The behaviours have all remained the same as before, all I done was add the new origin to the CloudFront distribution and then change each behaviour to reference the new origin.

FYI I did note though that there's a bug in AWS, during the edit of the behaviours it cleared out the whitelisted headers, so I had to reselect 'Host' otherwise the page had a validation error of 'To use SSL with an ELB origin, either forward all headers or whitelist the Host header. If you do not want to forward any headers, change the Origin Protocol Policy to HTTP Only.'


Solution

  • Currently it looks like the issue is related to how the SSL certificate is wildcarded for *.mycompany.com but accessing the site through mycompany.com (apex directly) isn't covered by the wildcard.

    I suspect that I need to edit or create a new ACM provisioned certificate that explicitly list both mycompany.com and *.mycompany.com - I'm awaiting confirmation from AWS.

    It seems that Apple must be requesting the file from our server without the www prefix, but maybe once it fails it retries with the www prefix, which I think that this would explain why I see almost a 50% failure (fail first time, retry works = 50% success).

    UPDATE: I can confirm that since adding the top-level apex (mycompany.com) to the SSL certificate, in addition to the wild carded domain (*.mycompany.com) the errors are no longer seen.