node.js graphql google-kubernetes-engine apollo apollo-federation

Inter-service communication between Apollo Federation subgraphs

Let's say we have S1, S2 subgraphs, and G gateway.

S1 subgraph service needs some data from the S2 service. How should it be handled through the gateway and schema level? Should we use gateway in this kind of communication?

Should we have a separated schema & Apollo server inside every subgraph that contains the internal queries and mutations? Should S1 call directly S2 "internal apollo server"?

By default, all user-facing requests need to be authorized by JWT, but internal communications should work without this.

Subgraphs are not available on the public network, but they're running on the same internal network. Technically they can see each other. They're hosted on GKE.

Solution

This is actually one of my favorite topics! To start off, as usual the answer is it depends. With that out of the way lets get into some details...

The main variable to look at is how fast your system needs to scale in terms of how many subgraphs you need. Here's a few solutions with varying levels of complexity and effort to implement.

Calling the Gateway

For a few services, like your example of just 2, you can easily get away with calling the gateway from either service. This has the advantage of being easy to implement, and can be a decent way to use federation for a smaller company with only 10 or so subgraphs, or data that is generally sparse. This method has significant drawbacks though:

Every hop is 1. reauthorized and 2. adds increased latency from having to do a round trip. In very large microservice environments (think Google, Facebook, Netflix etc...) this can become problematic because some features could have a vertical call stack of 10-20 services, and it all needs to resolve in lets say 200 milliseconds to meet SLO requirements. It would be a deal breaker if at each hop you needed to perform a round trip to the gateway which would include authorization and encyption. Many services don't even do inter-service encryption if there isn't any user-privacy concerns with the feature.
You need to be careful with circular dependencies. The gateway will not know that an internal service is making the call and just complete the request as usual. This sounds easy if you're a small team, but as things grow you could end up with a dependency of S1 -> S2 -> S3 -> S4 -> S5 -> S1 without even realizing it, and it could creep up in the small edge cases and be difficult to debug. Circular dependencies are always a potential problem in large systems, but by using the gateway it is harder to catch without a strong understanding of the entire call tree of each query.
You will be limited to the same view fields the client has. Using this method there is no way for a service to provide additional information internally that could be useful for performing certain actions. This could be something like stock inventory that you may not want users to be able to query.

It's important to note that most of these issues only start to become deal breakers at around a medium scale, so its worth evaluating your requirements before deciding against this method.

Calling the Subgraph Directly

If the subgraphs are designed with proper concerned-based seperation then the information that you need should be available from the subgraph, without needing the gateway to resolve additional fields. This solves the latency concerns from calling the gateway directly and makes managing a dependency graph easier. It does not solve the view fields concern discussed above. This is what I personally do, but it seems to be a bit of a controversial pattern.

Note: you mentioned that everything is running in GKE. you could use cluster local dns to access the services, but I highly recommend looking into service meshes to solve the issue of routing between the subgraphs. Istio, Nginx Service Mesh, and LinkerD are a few options to consider.

Using an Internal Backplane

As a system continues to grow, it becomes more and more important to treat the supergraph as the backend for the product. The internal design of the product likely won't only use graphql. Large companies will even develop entire stand-alone internal platforms if it solves a problem that they run into regularly. In this example these systems likely do not need to be added to the super graph. This is close to your example of using an "internal apollo server", but I honestly would recommend just using REST or gRPC. To keep to the spirit of microservice architecture, you could even have the main service use gRPC, and the subgraph registered with the gateway would just be a view layer for the gRPC server.