Search code examples
cmpimpi-io

MPI_File_read_at line by line


I'm a new learner of MPI. I'd like to use MPI_File_read_at() to read the data from a txt file line by line. The length of each line is different, so when I am reading one line (set a length of buffer), sometimes it will also read the next line, send some part of next line into the buffer, which really cause a problem... So, I'm wondering is there any way I can use MPI_File_read_at() to read data line by line? to let it stop when meet "\n" at the end of each line? Or do you have better suggestion to read data line by line by using a MPI function other than MPI_File_read_at() ?

I guess my question is how to use MPI_File_read_at() to do the same thing as fgets does

/*the traditional way to read a file line by line:*/
for(i=0;i<nlens;i++)
{
    fgets(line, 20, fp);
    sscanf(line,"%d%d",&e1,&e2);
}


/*the way I am doing this by MPI*/

int offset = 0;
int count = 15;
char line[15];
for(i=0;i<nlens;i++)
{
    offset=i*count;
    MPI_File_read_at(fp, offset, line, count, MPI_CHAR, &status);
    printf("line%d:/n%s/n",i,line);
}

/*So, if I have a file looks like:*/
0        2
0        44353
3        423
4        012312
5        2212
5        476

/*the output of mpi version  will be:*/
line0: 
0        2
0
line1:
      44353
3
line2:
      423
4
line3:
    012312
5
line4:
      2212
5
line5:
     476
2
5

void read_data(char address_data[], int num_edges, int link[][100])
{
    int i,e1,e2;
    FILE *fp;
    char line[30];

    int totalTaskNum, rankID;
    MPI_Comm_rank(MPI_COMM_WORLD, &rankID);
    MPI_Comm_size(MPI_COMM_WORLD, &totalTaskNum);
    MPI_Status status;
    if(rankID == 0)  /*to avoid different processors access to the file simultaneously*/
    {
        fp = fopen(address_data, "r");
        for(i=0;i<num_edges;i++)
        {
            fgets(line, 20, fp);
            sscanf(line,"%d%d",&e1,&e2);
            link[e1][e2]++;
        }
        fclose(fp);
    }
}

Solution

  • MPI I/O is for reading binary files, where each record is represented by a struct, so this way you can loop the records, or find a record at a given position.

    But with text files you can't know ahead of time where each line ends/starts.

    If you want to read a text file, you will need to pass the data of the file to a buffer, and then read the data char by char, and whenever c == '\n', then you now that the current line ends.

    If you need to find a line just once and that's it, then you just read the file like i said, incrementing a counter whenever a line starts, so you know which is the line you want.

    But if your program needs to look up to it multiple times (that's more likely), an idea is that, you read the file once, build an index of the position of the lines, and then you can know where each line is. Of course if you update the file afterwards then you need to update the index accordingly.

    Edit: you can also take a look at this answer. It has helped me in the past.

    Edit 2: what you asked. The MASTER process reads the file into a 2D int array.

    i create a file matrix.txt and i placed this space separated values:

    0 2
    0 44353
    3 423
    4 012312
    5 2212
    5 476
    

    Here is the program:

    #include <mpi.h>
    
    #define FILENAME    "matrix.txt"
    #define WIDTH       2
    #define HEIGHT      6
    
    
    int main() {
    
        int rank, world_size;
    
        MPI_Init(NULL, NULL);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    
    
        int matrix[HEIGHT][WIDTH];
    
        /**
         * The MASTER process reads the file
         */
        if (rank == 0) { 
            FILE *fp = fopen(FILENAME, "r");
    
            char line[100];
            int i = 0;
            while (fgets(line, 100, fp)) {
                sscanf(line, "%d %d", &matrix[i][0], &matrix[i][1]);
                i++;
            }
        }
    
        //Broadcast it to all processes
        MPI_Bcast(matrix, HEIGHT * WIDTH, MPI_INT, 0, MPI_COMM_WORLD);
    
        //just for demo purposes each process prints its array orderly
        int p = 0, i = 0, j = 0;
        for (p = 0; p < world_size; p++) {
            MPI_Barrier(MPI_COMM_WORLD);
            if (p == rank) {
                printf("----------\n proc:%d received:\n", rank);
                for (i = 0; i < HEIGHT; i++) {
                    for (j = 0; j < WIDTH; j++) {
                        printf("%d\t", matrix[i][j]);
                    }
                    printf("\n");
                }
            }
        }
    
    
        MPI_Finalize();
        return 0;
    }
    

    I made a few assumptions, like that you know in advance the size of the matrix in the file. If you want to make your program more dynamic i leave that up to you. Also after reading the file i broadcast the matrix to all the process so you can test that it works. since you say that you want to do matrix multiplication you will need to follow a decomposition strategy using MPI_Scatter or something else so each process will receive a different chunk.

    Here is the output:

    $ mpirun -np 4 ./program 
    ----------
     proc:0 received:
    0       2       
    0       44353   
    3       423     
    4       12312   
    5       2212    
    5       476     
    ----------
     proc:1 received:
    0       2       
    0       44353   
    3       423     
    4       12312   
    5       2212    
    5       476     
    ----------
     proc:2 received:
    0       2       
    0       44353   
    3       423     
    4       12312   
    5       2212    
    5       476     
    ----------
     proc:3 received:
    0       2       
    0       44353   
    3       423     
    4       12312   
    5       2212    
    5       476