Horizontal scaling and GraphQL within a Node.js environment

I am trying to build an application that contains an instant messaging module, and one of the main challenges is to keep the application scalable whatever the number of the users or the messages that are exchanged is.

In an article I read that it is possible to build real time applications using GraphQL with “subscriptions”, and in addition to that, it is a simple to use protocole and has the advantage of minimizing roundtrip objects retrievals, and hence less resources use.

But what if we need to add a new server/node to the system in order to scale horizontally? Is this possible using GraphQL?

Taking an example of websockets implementation that allows horizontal scaling, there is SocketCluster. I wonder if an application that is developed by GraphQL alone can be scalable across multiple nodes/machines or it must be used with another framework like SocketCluster in order to achieve this end.

Solution

Shortly - yes. We have done it, and it works pretty well.

The trick is, you have to think deeper than just an API worker applications when it comes to horizontal scaling. If you want push architecture, it needs to be asynchronous from the very beginning.

To achieve it, we used queueing systems, namely RabbitMQ.

Imagine this scenario of report generation, which can take up to 10 minutes:

Client connects to our GraphQL API (instance 1) via WebSocket
Client sends a command to generate a report via WebSocket
API generates token for the command and puts the command to generate a report in CommandQueue (in RabbitMQ), returning the token to Client.
Client subscribes to events of its command result, using the token
Some backend Worker picks up the command and executes the report generation procedure
During this time GraphQL API (instance 1) dies
Client automatically reconnects to GraphQL API (instance 2)
Client renews the subscription with the previously acquired token
The Worker is done, results on the EventsQueue (RabbitMQ)
ALL of our GraphQL instances receive information on the ReportGenerationDoneEvent and check if anybody is listening for its token.
GraphQL API (instance 2) sees that Client is awaiting results. Pushes the results via websockets.
GraphQL API (instances 3-100) ignore the ReportGenerationDoneEvent.

It is quite a bit extensive, but with simple abstractions, you do not have to think about all this complexity and write ~30 lines of code across several services for a new process using this route.

And what is brilliant about it, you end up with nice horizontal scaling, event replayability (retries), separation of concerns (client, api, workers), push out the data as quickly as possible to the client, and as you mentioned you do not waste bandwidth on the are we done yet? requests.

Another cool thing is, that whenever the user opens reports list within our panel, he sees currently generating reports, and can subscribe to their changes, so they do not have to refresh the list manually.

Good thinking on the SocketCluster. It would optimize step 10 in above scenario, but for now, we do not see any performance issues with broadcasting the ReportGenerationDoneEvent to the whole API cluster. With more instances or multi-region architecture, it would be a must, as it would allow for better scaling and sharding.

It is important to understand that SocketCluster operates on the layer of communication (WebSockets), but the logical API layer (GraphQL) is above that. To make a GraphQL Subscription, you just have to use a communication protocol that allows you to push information to the user, and WebSockets allow that.

I think using SocketCluster is a good design choice, but remember to iterate with implementation. Only use SocketCluster when you plan to have many sockets open at any single point in time. Also, you should subscribe only when necessary, because WebSocket is stateful and requires management and heartbeats.

If you are further interested in asynchronous backend architecture I used above, read up on CQRS and Event Sourcing patterns.