My Spring Boot REST endpoint returns HTTP status immediately but sometimes waits 60 seconds before returning the result.
This is caused by a Feign-proxied request's headers being altered (Transfer-Encoding
being replaced with Content-Length
)
The context and results of the investigation are detailed below.
Transfer-Encoding: chunked
header, but Traefik replaces it with Content-Length
depending on unknown criteria.Content-Length
header solves the issue. Debugging sun.www.net.HttpClient confirms that having a Content-Length
header triggers the use of a KeepAlive stream.Connection: close
header seems to solve the issue (same reason: this prevents the use of the KeepAliveStream).One of our latest tests was to use -v
with curl and we saw this while A was pausing:
$ curl -i -v http://localhost:8100/a/proxied-endpoint
#...
< Content-Length: 1843
<
{ [1793 bytes data]
So, the service hangs and waits for 50 missing bytes. When interrupted though, it returns the whole response. I'm thinking of an encooding issue there, but don't understand where it could happen.
Replacing the content length with 1793
(or lower) makes the endpoint makes the endpoint return immediately.
There is a discrepancy between the way the Content-Length
header is computed and the way our client calculates it upon receiving the body.
I'm having an issue with a Spring Boot REST controller (A) that acts as a proxy to another service (B). The endpoint basically looks like this:
@GetMapping("/{resource}")
public ResponseEntity<List<Something>> getAll(@PathVariable resource) {
return myFeignProxy.getAll(resource);
}
There's a Traefik reverse proxy between A and B. In summary: A -> Traefik -> B.
In all cases, the Feign proxies answers in less than 100 ms and the endpoint returns the HTTP status (200) immediately. However, in most cases, the body is not returned immediately. A's Spring Boot waits for precisely 60 seconds (this is really not random).
Whether the body will be returned immediately or after 60 s seems to depend upon resource
: some resources are always available immediately, the others have the wait. Once again, this does not seem random.
[Edit]: Investigation has shown that, in the cases where A pauses, Traefik replaced B's original Transfer-Encoding
header with Content-Length
.
Based on this header, sun.net.www.HttpClient
would decide to use a KeepAliveStream.
The problem is that this stream then doesn't close.
Spring Boot: 2.2.6
Tomcat: 9.0.33
Feign: (determined by Spring Cloud 2.1.2)
Traefik: 2.2.11
It is not an issue with the proxied service (B) being slow. In all cases, myFeignProxy
responds in a few ms and the endpoint returns the 200 HTTP status immediately.
I've tried changing Feign client timeouts, without any impact.
I also see no correlation between the pause, the size of the body and the time of response of the feign proxy.
Resource | Size (KB) | Feign time (ms) | 60s pause |
---|---|---|---|
1 | 1.87 | 34 | yes |
2 | 3.29 | 35 | no |
3 | 1.55 | 34 | yes |
4 | 10.05 | 81 | yes |
The problem is not related to Spring Security either, as entirely removing it (configuration and dependencies) does not change the symptoms.
The pause seems to come from Tomcat. Replacing the Tomcat starter with the Jetty starter (in A) eliminates the issue (all requests answer immediately). That being said, it doesn't explain the problem.
It appears that, for an endpoint where the pause appears, there is one additional line in logs during the calls. See below for examples. The parameters for the HttpURLConnection also seem to be different, though I do not understand why.
Case without pause
DEBUG [nio-8100-exec-9] s.n.www.protocol.http.HttpURLConnection : sun.net.www.MessageHeader@784b4a945 pairs: {GET /xxx HTTP/1.1: null}{Accept: application/json}{User-Agent: Java/11.0.7}{Host: xxx}{Connection: keep-alive}
DEBUG [nio-8100-exec-9] s.n.www.protocol.http.HttpURLConnection : sun.net.www.MessageHeader@2a3818a612 pairs: {null: HTTP/1.1 200 OK}{Cache-Control: no-cache, no-store, max-age=0, must-revalidate}{Content-Type: application/json}{Date: Tue, 20 Apr 2021 07:47:47 GMT}{Expires: 0}{Pragma: no-cache}{Strict-Transport-Security: max-age=31536000 ; includeSubDomains}{Vary: accept-encoding}{X-Content-Type-Options: nosniff}{X-Frame-Options: DENY}{X-Xss-Protection: 1; mode=block}{Transfer-Encoding: chunked}
Case with pause
DEBUG [nio-8100-exec-6] s.n.www.protocol.http.HttpURLConnection : sun.net.www.MessageHeader@7bff99e75 pairs: {GET /xxx HTTP/1.1: null}{Accept: application/json}{User-Agent: Java/11.0.7}{Host: xxx}{Connection: keep-alive}
TRACE [nio-8100-exec-6] s.n.www.protocol.http.HttpURLConnection : KeepAlive stream used: https://xxx/xxx
DEBUG [nio-8100-exec-6] s.n.www.protocol.http.HttpURLConnection : sun.net.www.MessageHeader@5aed6c9312 pairs: {null: HTTP/1.1 200 OK}{Cache-Control: no-cache, no-store, max-age=0, must-revalidate}{Content-Type: application/json}{Date: Tue, 20 Apr 2021 07:57:42 GMT}{Expires: 0}{Pragma: no-cache}{Strict-Transport-Security: max-age=31536000 ; includeSubDomains}{Vary: accept-encoding}{X-Content-Type-Options: nosniff}{X-Frame-Options: DENY}{X-Xss-Protection: 1; mode=block}{Content-Length: 803}
When finally responding after the pause (not present when responding immediately)
DEBUG [nio-8100-exec-7] o.apache.tomcat.util.threads.LimitLatch : Counting down[http-nio-8100-exec-7] latch=1
DEBUG [nio-8100-exec-7] org.apache.tomcat.util.net.NioEndpoint : Calling [org.apache.tomcat.util.net.NioEndpoint@63668501].closeSocket([org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@cfdc708:org.apache.tomcat.util.net.NioChannel@6e7c15b6:java.nio.channels.SocketChannel[connected local=/0:0:0:0:0:0:0:1:8100 remote=/0:0:0:0:0:0:0:1:52501]])
The additional log ("KeepAlive stream used") occurs in sun.net.www.http.HttpClient
. I seem to understand the decision to use this KeepAlive depends on the proxied response's headers.
Traefik changes the headers between A and B.
B always returns its response with Transfer-Encoding: chunked
.
Traefik sometimes replaces it with Content-Length
and the correct size for the payload.
Nothing is configured in our Traefik instance concerning headers.
The rules used to decide between Transfer-Encoding
and Content-Length
seem hard to grasp:
This explains why the problem is not reproducible when both applications are on the local machine, since there is the no Traefik between them.
About the rules that Traefik applies, it appears the HTTP version plays a role.
$ curl -s -o /dev/null -D - --http1.1 https://traefik/b/endpoint
HTTP/1.1 200 OK
# ...
Transfer-Encoding: chunked
$ curl -s -o /dev/null -D - --http2 https://traefik/b/endpoint
HTTP/2 200
# ...
content-length: 2875
Traefik always returns the same headers for a given endpoint, so we can think that the headers also depend on the address or, more presumably, on the payload (a given endpoint always returns the same payload for this service).
Performing a git bisect, I discovered the 60-second pause problem appeared when the proxied service (B) started using ZonedDateTime instead of LocalDateTime in its DTO. The only change is that date time fields now have an offset in the response body, there is no impact on the headers. Yet, the Feign client works fine for LocalDateTimes and pauses for ZonedDateTimes.
Passing the Connection: close
header makes the pause disappear in A.
The response body is returned immediately.
HttpClient does not use the KeepAliveStream in this case.
I wrote a quick mock service B.
It returns the Content-Type
header and the content.
What's interesting is that:
Content-Length
header, then A has a 60-second pause.Content-Length
header, then A returns immediately.This is consistent with previous tests indicating that the Content-Length
header plays a role, but I am still unable to understand which, since it is present in some Traefik occurrences that A still returns immediately.
mock-B
const port = 8080;
const http = require('http');
const path = require('path');
const fs = require('fs');
const countBytesInString = str => encodeURI(str).split(/%..|./).length - 1
const requestListener = (req, res) => {
console.log(`\n>> Mock call to endpoint ${req.url}`);
fs.readFile(path.join(__dirname, `endpoints${req.url}.json`), 'utf8' , (err, data) => {
if (err) {
console.error(err)
return;
}
const contentLength = countBytesInString(data);
console.log(`Content-Length: ${contentLength}`);
res.setHeader('Content-Type', 'application/json');
res.setHeader('content-length', contentLength);
res.writeHead(200);
res.end(data);
})
};
const server = http.createServer(requestListener);
server.listen(port);
console.log(`Mock server listening on port ${port}`);
We've finally understood the mechanism that leads to the issue.
A -> Traefik -> B
"validFrom":"2021-12-24 23:59:57+01:00"
) and the header Transfer-Encoding: chunked
.Transfer-Encoding: chunked
with a Content-Length
, computed from the body of the request."validFrom":"2021-12-24 22:59:57Z"
), but it reuses the Content-Length
from Traefik without recalculating it.As a consequence, the body from is shorter than the announced Content-Length
(each ZonedDateTime takes five bytes less when A sends it than when Traefik computes the content length).
The client however has been announced a Content-Length
and is waiting for the missing bytes.
The solution we have in mind right now is to tell Feign and its calling controller that it returns a ResponseEntity<String>
instead of a ResponseEntity<List<MyObject>>
.
Pros:
Cons: