Comparing Akka Actors pipToSelf and Future on systems with high throughput and low latency

I have a question regarding pipeToSelf vs Futures.
Suppose we have a Scala backend application that runs on an n2d-standard-16 Google Cloud machine type with 16 cores, where each node hosts a single pod of the application.

Cluster details:

serves 200K RPS with an internal processing time of 1ms on p95.
using akka configurations with default dispatcher using fork-join-executor with 16 threads serving all the traffic.
Each HTTP request is directed to a group of 30 stateless actors, some of whom perform blocking operations while others execute non-blocking operations that involve querying a distributed cache. Furthermore, 16 extra threads handle read operations for this distributed cache, makes it total 32 threads.
Each non-blocking operation uses context.pipeToSelf in order to retrieve data from distributed cache and then let the actor return the result the the sender.

Question:

In such system, which is high throughput and low latency - is there any overhead with using context.pipeToSelf(readFromCache) instead of doing using Future monad capabilities? e.g

readFromCache.map { reply => ...  }.recover { case NonFatal(ex) => ... }

with context.executionContext executor.

My concerns are:

Is there benchmark available regarding the amount of extra effort required to insert a message into the mailbox and retrieve it later? Additionally, is it possible to obtain a reference to the specific class that extends ActorRef trait and implements the tell method?
The internal processing time is not getting higher? If we process the first message and then send the result back to ourselves, it is possible that the mailbox may contain other messages already, which could result in increased latency due to the FIFO nature of the mailbox. Those messages can be from another HTTP request. with Future the map and recover occurs on the same thread and return the answer before handling other messages.
Due to the default configuration of throughput = 5, it is possible that all available threads are currently in use. If an execution context occurs between threads when the answer from the distributed cache needs to be returned due to an additional message in the mailbox from pipeToSelf, there may be a delay in waiting for another thread to become available.

Solution

is there any overhead with using context.pipeToSelf(readFromCache) instead of doing using Future monad capabilities?

Absolutely there is more overhead. Actors do a lot more than just handle async processing.

Does that overhead matter to you? Hard to tell without testing. But I (and everyone I know) generally has the rule that if I can be done as easily in a Future as in an actor, then you should do it in a Future. Actors should be used only when you actually need the features of actors.

EDIT: Also, to be clear, my understanding of the question was "should I use actors or should I use monad based Futures". Not "should I use pipeToSelf with an Actor or Futures". You (normally) can't just use a Future within an Actor directly for the reasons explained within pipeToSelf.

Is there benchmark available regarding the amount of extra effort required to insert a message into the mailbox and retrieve it later?

I strongly doubt that there are benchmarks for that particular metric. At its heart, the mailbox is just a queue. The implementation of that queue can vary slightly based on your actor configuration, but it's still a queue and likely a very simple one. Head/tail operations on queues are absurdly fast. Again, I'm not pretending there's no overhead with Actors in general, but mailbox operations are never my concern.

I don't want to be dismissive. 200K requests per second at 1ms is serious business. Fundamentally you'll have to test and measure everything. But enqueue/dequeue time is not my instinctive big concern.

There are some bigger microbenchmarks you could take a look at. Although I still generally feel like the only real way to predict a real application's performance is to test that application's performance.

The internal processing time is not getting higher? If we process the first message and then send the result back to ourselves, it is possible that the mailbox may contain other messages already, which could result in increased latency due to the FIFO nature of the mailbox. Those messages can be from another HTTP request. with Future the map and recover occurs on the same thread and return the answer before handling other messages.

In theory, yes. In reality this is not a real concern unless you mess up and have blocking operations in a non-blocking dispatcher.

Of course, there's no real way of knowing without testing, especially at 200K per second. But the reality of most real world scenarios is that you are either keeping up with the load or you are not keeping up with the load. If you are keeping up with your load the mean size of a non-blocking mailbox will be close to zero. There are definitely exceptions, especially around blocking mailboxes, but there are people I know that run real time systems on Akka; 1ms latency is attainable.

Another way to think of it is that the real overhead and latency is in the context switch to the thread. Even under the unlikely scenario that your mailbox has gotten "long" (say a few dozen), it really isn't going to take long to process all of the them. The latency is waiting to get scheduled, not in the processing of the mailbox. Which leads to your next question.

Due to the default configuration of throughput = 5, it is possible that all available threads are currently in use. If an execution context occurs between threads when the answer from the distributed cache needs to be returned due to an additional message in the mailbox from pipeToSelf, there may be a delay in waiting for another thread to become available.

The important config for what you are talking about is parallelism, which as you say, will result in 16 threads (by default). And, yes, as I stated above, it's absolutely possible you might get a message received by a mailbox and not have a thread available to process that message right away.

This is definitely something you should pay attention to if you want sub 1ms latency. But, on the other hand, the reality is that you only have 16 cores. So if you had twice as many theads you could more easily get a thread but the extra threads would be useless because they would have a hard time fighting for physical CPU time. In fact, mailboxes might actually help you here because the batching would reduce context switching.

Fundamentally, if you choose to go with Akka, the new licensing terms require you to have a contract with Lightbend anyway. Get them engaged, run the ideas by the them in detail. Even if you aren't sure if you want to go with Akka, talk to Lightbend as they may be able to give you some specific customer examples.