multithreading file-io parallel-processing hpc mpi-io

multithreaded combining of files (shared memory)

Lets say I have N files in a format like this:

One File looks like this:
For each time there is some amount of data with different id

- time 1:
      -  data with id: 10
      -  data with id: 13
      -  data with id: 4
- time 2:
      -  data with id: 10
      -  data with id: 77
...etc

(for each time the data with ids from 1-1000 are spreaded some how (mixed) over these N files)

I would like to combine all these N files so that I have a single file which is ordered :

Final File:

- time 1:
       -  data with id: 1
       -  data with id: 2
       -  data with id: 3
       -  ...
       -  data with id: 1000

- time 2:
       -  data with id: 1
       -  data with id: 2
       -  data with id: 3
       -  ...
       -  data with id: 1000
...etc

The size of data id 1-1000 is approximately 100mb, but I have a lot of times which accounts for up to 50 Gbytes of data.

My solution for this problem would be so far like this to make this as fast as possible:

I use T threads on a supercomputer node (1 computer with e.g. 24-48 cores) (for example). I have allocated a shared memory array to hold all datas with ids 1 - 1000 for one time (can also be more if I like)

Procedure:
Step 1:

Each thread has some files it opens and owns. Each thread then fills in the data of the ids it has in the files into the shared array.

Step 2:

When all threads have finally processed one time --> Thread 1 writes this array in ordered form to the final file.

asdasd

I would be very much interested if that is efficient? Is the parallel read not sequentialized anyway so it is for no use at all? I could compute the final file on a local computer with ultra fast SSD or on a cluster node with network storage (Lustres or Panasas Filesystems)
Could I use all threads again in step 2 to write in parallel to the disk lets say with MPI IO (which supports parallel write by offsets), or how else can that be achieved? -> the c++ standart library?

Thanks for any inputs!

Solution

Your approach will probably work OK for a moderate amount of data, but you've made one rank the central point of communication here. That's not going to scale terribly well.

You're on the right track with your part 2: a parallel write using MPI-IO sounds like a good approach to me. Here's how that might go:

Continue to have your T processes read their inputs.
I'm going to assume that 'id' is densely allocated. What I mean is, in this collection of files, can a process know if it sees data with id: 4 that some other processes have id 1, 2, 3, and 5 ? If so, then every process knows where it's data has to go.
Let's also assume each 'data' is fixed size. The approach is only a little more complicated if that's not the case.

If you don't know the max ID and the max timesteps, you'd have to do a bit of communication (MPI_Allreduce with MPI_MAX as the operation) to find that.

With these preliminaries, you can set an MPI-IO "file view", probably using MPI_Type_indexed

On rank 0, this gets a bit more complicated because you need to add to your list of data the timestep markers. Or, you can define a file format with an index of timesteps, and store that index in a header or footer.

The code would look roughly like this:

for(i=0; i<nitems; i++)
    datalen[i] = sizeof(item);
    offsets[i] = sizeof(item)*index_of_item;
}
MPI_Type_create_indexed(nitems, datalen, offsets, MPI_BYTE, &filetype);
MPI_File_set_view(fh, 0, MPI_BYTE, filetype, "native", MPI_INFO_NULL);
MPI_File_write_all(fh, buffer, nitems*sizeof(item), MPI_BYTE, &status);

The _all bit here is important: you're going to create a highly non-contiguous, irregular access pattern from each MPI processor. Give the MPI-IO library a chance to optimize that request.

Also it's important to note that MPI-IO file views must be monotonically non-decreasing, so you'll have to sort the items locally before writing the data out collectively. Local memory operations have an insignificant cost relative to an I/O operation, so this usually isn't a big deal.