I'm running a simulation with thousands of MPI processes and need to write output data to a small set of files. For example, even though I might have 10,000 processes I only want to write out 10 files, with 1,000 writing to each one (at some appropriate offset). AFAIK the correct way to do this is to create a new communicator for the groups of processes that will be writing to the same files, open a shared file for that communicator with MPI_File_open()
, and then write to it with MPI_File_write_at_all()
. Is that correct? The following code is a toy example that I wrote up:
#include <mpi.h>
#include <math.h>
#include <stdio.h>
const int MAX_NUM_FILES = 4;
int main(){
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int numProcs;
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
int numProcsPerFile = ceil(((double) numProcs) / MAX_NUM_FILES);
int targetFile = rank / numProcsPerFile;
MPI_Comm fileComm;
MPI_Comm_split(MPI_COMM_WORLD, targetFile, rank, &fileComm);
int targetFileRank;
MPI_Comm_rank(fileComm, &targetFileRank);
char filename[20]; // Sufficient for testing purposes
snprintf(filename, 20, "out_%d.dat", targetFile);
printf(
"Proc %d: writing to file %s with rank %d\n", rank, filename,
targetFileRank);
MPI_File outFile;
MPI_File_open(
fileComm, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &outFile);
char bufToWrite[4];
snprintf(bufToWrite, 4, "%3d", rank);
MPI_File_write_at_all(
outFile, targetFileRank * 3,
bufToWrite, 3, MPI_CHAR, MPI_STATUS_IGNORE);
MPI_File_close(&outFile);
MPI_Finalize();
}
I can compile with mpicc file.c -lm
and run, say, 20 processes with mpirun -np 20 a.out
, and I get the expected output (four files with five entries each), but I'm unsure whether this is the technically correct/most optimal way of doing it. Is there anything I should do differently?
Your approach is correct. To clarify, we need to revisit the standard and the definitions. MPI_File_Open API from MPI: A Message-Passing Interface Standard Version 2.2 (page 391)
int MPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info, MPI_File *fh)
Description:
MPI_FILE_OPEN opens the file identified by the file name filename on all processes in the comm communicator group. MPI_FILE_OPEN is a collective routine: all processes must provide the same value for amode, and all processes must provide filenames that reference the same file. (Values for info may vary.) comm must be an intracommunicator; it is erroneous to pass an intercommunicator to MPI_FILE_OPEN.
intracommunicator vs intercommunicator (page 134):
For the purposes of this chapter, it is sufficient to know that there are two types of communicators: intra-communicators and inter-communicators. An intracommunicator can be thought of as an identifier for a single group of processes linked with a context. An intercommunicator identifies two distinct groups of processes linked with a context.
The point of passing an intracommunicator to MPI_File_open()
is to specify a set of processes that will perform operations on the file. This information is needed by the MPI runtime, so it could enforce appropriate synchronizations when collective I/O operations occur. It is the programmer's responsibility to understand the logic of the application and create/choose the correct intracommunicators.
MPI_Comm_Split()
in a powerful API that allows to split a communicating group into disjoint subgroups to use for different use cases including MPI I/O.