Search code examples
cmpistdoutdistributed-computing

Suggestion distributed computing, getting data from a stream


In my software I read information from a stream X (stdout of another process) with process 1, then I send the information read to the other N-1 processes and finally I collected in process 1 all data elaborated by the N processes.

Now my question is: "What's the most efficient way to share the information read from the stream between processes?"

PS. Processes may also be in different computer connected through a network.

Here I list some possibilities:

  1. Counting lines of stream (M lines), save to N files M/N lines and send to each process 1 file.
  2. Counting lines of stream (M lines), allocate enough memory to contain all information, send to each process directly the information.

But I think that these can be some problem:

  1. Writing so much files can be an overhead and sending files over a network isn't efficient at all.
  2. I need enough memory in process 1, so that process can be a bottleneck .

What do you suggest, do you have better ideas? I'm using MPI on C to make this computation.


Solution

  • Using files is just fine if performance is not an issue. The advantage is, that you keep everything modular with the files as a decoupled interface. You can even use very simple command line tools:

    ./YOUR_COMMAND > SPLIT_ALL
    split -n l/$(N) -d SPLIT_ALL SPLIT_FILES
    

    Set N in your shell or replace appropriately. Note: Unfortunately you cannot pipe directly into split in this case, because it then cannot determine the total number of lines when reading from stdin. If round robin, rather than contiguous split is fine, you can pipe directly:

    ./YOUR_COMMAND | split -n r/$(N) -d - SPLIT_FILES
    

    You second solution is also fine - if you have enough memory. Keep in mind to use appropriate collective operations, e.g. MPI_Scatter(v) for sending, and MPI_Gather or MPI_Reduce for receiving the data from the clients.

    If you run out of memory, then buffer the input in chunks (of for instance 100,000 lines), and then scatter the chunks to your workers, compute, collect the result, and repeat.