Steps to improve throughput of Node JS server application

I have a very simple nodejs application that accepts json data (1KB approx.) via POST request body. The response is sent back immediately to the client and the json is posted asynchronously to an Apache Kafka queue. The number of simultaneous requests can go as high as 10000 per second which we are simulating using Apache Jmeter running on three different machines. The target is to achieve an average throughput of less than one second with no failed requests.

On a 4 core machine, the app handles upto 4015 requests per second without any failures. However since the target is 10000 requests per second, we deployed the node app in a clustered environment.

Both clustering in the same machine and clustering between two different machines (as described here) were implemented. Nginx was used as a load balancer to round robin the incoming requests between the two node instances. We expected a significant improvement in the throughput (like documented here) but the results were on the contrary. The number of successful requests dropped to around 3100 requests per second.

My questions are:

What could have gone wrong in the clustered approach?
Is this even the right way to increase the throughput of Node application?
We also did a similar exercise with a java web application in Tomcat container and it performed as expected 4000 requests with a single instance and around 5000 successful requests in a cluster with two instances. This is in contradiction to our belief that nodejs performs better than a Tomcat. Is tomcat generally better because of its thread per request model?

Thanks a lot in advance.

Solution

Per your request, I'll put my comments into an answer:

Clustering is generally the right approach, but whether or not it helps depends upon where your bottleneck is. You will need to do some measuring and some experiments to determine that. If you are CPU-bound and running on a multi-core computer, then clustering should help significantly. I wonder if your bottleneck is something besides CPU such as networking or other shared I/O or even Nginx? If that's the case, then you need to fix that before you would see the benefits of clustering.

Is tomcat generally better because of its thread per request model?

No. That's not a good generalization. If you are CPU-bound, then threading can help (and so can clustering with nodejs). But, if you are I/O bound, then threads are often more expensive than async I/O like nodejs because of the resource overhead of the threads themselves and the overhead of context switching between threads. Many apps are I/O bound which is one of the reasons node.js can be a very good choice for server design.

I forgot to mention that for http, we are using express instead of the native http provided by node. Hope it does not introduce an overhead to the request handling?

Express is very efficient and should not be the source of any of your issues.