Search code examples
akkadistributed-computing

AKKA: passing local resource to actor


Let's suppose I want to use AKKA actor model to create a program crunching data coming from files. Since the model, as far as I understood, is winning if the actor really are unaware on where they are running, passing the path of the file in the message should be an error -some actors when the app scales will possibly not to have access to that path -. By opposite, passing the entire file as bytes would not be an option due to resource issue ( what if file is big and bigger? ) What is the correct strategy to handle this situation? On the same question: would be the assumption of having a distributed file system a good excuse to accept paths as messages?


Solution

  • I don't think there's a single definitive answer, because it depends on the nature of the data and the "crunching". However, in the typical case where you really are doing data processing of the files, you are going to have to read the files into memory at some point. So, yes, the generally answer is to read the entire file as bytes.

    In answer to the question of "what if the file is bigger", that's why we have streaming libraries like Akka Streams. For example, a common case might be to use Alpakka to watch for files in a local directory (or FTP), parse them into records, filter/map the records to do some initial cleansing, and then stream those records to distributed actors to process. Because you are using streaming, Akka is not trying to load the whole file into memory at a time, and you get the benefit of backpressure so that you don't overload the actors doing the processing.

    That's not to say a distributed file system might not have uses. For example, so that you have high availability. If you upload a file to the local filesystem of an Akka node and the Akka node fails, you obviously lose access to your file. But that's really a separate issue to how you do distributed processing.