Search code examples
c++scanfmmap

Reading file consists of float number with mmap() in C++


I'm trying to read a file consists of 100000000 float numbers like 0.12345678 or -0.1234567 separated by space in c++. I used fscanf() to read the file and the codes is like this:

FILE *fid = fopen("testingfile.txt", "r");
if (fid == NULL)
    return false;

float v;

for (int i = 0; i < 100000000; i++)
    fscanf(fid, "%f", &v);

fclose(fid);

The file is 1199999988 bytes in size and took around 18 seconds to finish reading using fscanf().Therefore, I would like to use mmap() to speed up the reading and code is like this:

#define FILEPATH "testingfile.txt"

char text[10] = {'\0'};
struct stat s;
int status = stat(FILEPATH, &s);
int fd = open(FILEPATH, O_RDONLY);
if (fd == -1)
{
    perror("Error opening file for reading");
    return 0;
}

char *map = (char *)mmap(NULL, s.st_size, PROT_READ, MAP_SHARED, fd, 0);
close(fd);

if (map == MAP_FAILED)
{
    perror("Error mmapping the file");
    return 0;
}

for (int i = 0,j=0; i < s.st_size; i++)
{
    if (isspace(map[i]))
    {
        text[j] = '\0';
        j = 0;
        float v = atof(text);
        for (int j = 0; j < 10; j++)
            text[j] = '\0';
        continue;
    }
    text[j] = map[i];
    j++;

}
if (munmap(map, s.st_size) == -1)
{
    return 0;
}

However, it still takes around 14.5 seconds to finish reading. I found the most time consuming part is converting array to float,which consumes around 10 seconds

So I have three questions:

  1. Is there any way I can directly read float instead of char or

  2. Is there any better method to convert char array to float

  3. How does fscanf recognize floating point value and read it, which is much faster than atof().

Thanks in advance!


Solution

  • Based on the advice given, here are two possible solutions to this problem:

    The first approach would be a bit "stupid". Since the format of floating number values stored is known, conversion from char array to float number can be easily done without usingatof(). By removing atof(), it only takes 8 seconds to finish reading and conversion for the same file.

    The second approach is to change the store format of float numbers in the file (as advised by Jeremy Friesner). Floating number values are stored in binary format so that conversion part for mmap() is not required. The code becomes something like this:

    #define FILEPATH "myfile.bin"
    
    int main()
    {
    int start_s = clock();
    struct stat s;
    int status = stat(FILEPATH, &s);
    
    int fd = open(FILEPATH, O_RDONLY);
    if (fd == -1)
    {
        perror("Error opening file for reading");
        return 0;
    }
    
    float *map = (float *)mmap(NULL, s.st_size, PROT_READ, MAP_SHARED, fd, 0);
    close(fd);
    
    if (map == MAP_FAILED)
    {
        perror("Error mmapping the file");
        return 0;
    }
    
    for (int i = 0; i < s.st_size / 4; i++)
    {
        float v = map[i];
    }
    
    if (munmap(map, s.st_size) == -1)
    {
        return 0;
    }
    }
    

    This would dramatically reduce the time required to read the file in same size.