In my software I read information from a stream X (stdout of another process) with process 1, then I send the information read to the other N-1 processes and finally I collected in process 1 all data elaborated by the N processes.
Now my question is: "What's the most efficient way to share the information read from the stream between processes?"
PS. Processes may also be in different computer connected through a network.
Here I list some possibilities:
But I think that these can be some problem:
What do you suggest, do you have better ideas? I'm using MPI on C to make this computation.
Using files is just fine if performance is not an issue. The advantage is, that you keep everything modular with the files as a decoupled interface. You can even use very simple command line tools:
./YOUR_COMMAND > SPLIT_ALL
split -n l/$(N) -d SPLIT_ALL SPLIT_FILES
Set N
in your shell or replace appropriately.
Note: Unfortunately you cannot pipe directly into split
in this case, because it then cannot determine the total number of lines when reading from stdin. If round robin, rather than contiguous split is fine, you can pipe directly:
./YOUR_COMMAND | split -n r/$(N) -d - SPLIT_FILES
You second solution is also fine - if you have enough memory. Keep in mind to use appropriate collective operations, e.g. MPI_Scatter(v)
for sending, and MPI_Gather
or MPI_Reduce
for receiving the data from the clients.
If you run out of memory, then buffer the input in chunks (of for instance 100,000 lines), and then scatter the chunks to your workers, compute, collect the result, and repeat.