Search code examples
cperformancefileio

Efficiently read flattened file in C


I'm trying to read a large file that has one float per line in C. For this, I put together the code below. It works fine when testing on small data. However, when reading 600 million numbers this way, it is very slow. Any ideas for how I can speed it up? I'm generating the raw file via python, so re-formatting the data (to have multiple numbers in a line separated by commas for example) is also an option. Any insight into why this method is so slow would be greatly appreciated.

void read_file(float *W)
{
   FILE *fp;

   int i = 0;

// In this file, one row should contain only one NUMBER!!
// So flatten the matrix.
   if (fp = fopen("C:\\Users\\rohit\\Documents\\GitHub\\base\\numerical\\c\\ReadFile1\\Debug\\data.txt", "r")) {
      while (fscanf(fp, "%f", &W[i]) != EOF) {
         ++i;
      }
      fclose(fp);
   }

   fclose(fp);

   scanf("%d",&i);    
}

Solution

  • I encountered a similar problem years ago. The solution was to replace fscanf with fgets and strtod. This gave much more than a 10-fold improvement, if I recall correctly.

    So your loop:

      while (fscanf(fp, "%f", &W[i]) != EOF) {
         ++i;
      }
    

    should look something like:

      while (fgets(buf, sizeof buf, fp)) {
         W[i++] = strtod(buf, 0);
      }
    

    Edit: Error checking is always a good idea. So adding this in, the simple two-liner grows to about ten lines:

      char buf[80];
      errno = 0;
      while (!errno && fgets(buf, sizeof buf, fp)) {
          W[i++] = strtod(buf, 0);
      }
      if (errno) { // Maybe ERANGE or EINVAL from strtod, or a read error like EINTR
          int save = errno;
          printf("errno=%d reading line %d\n", save, i); // or perror()
          exit(1);
      }
    

    Edit 2: Regarding error checking, the input file could easily contain text such as nan or inf, perhaps from some upstream bug. But strtod and fscanf are perfectly happy to parse these. And this could cause mysterious problems in your code.

    But it is easy enough to check. Add the code:

      int bad = 0;
      for (int j = 0; j < i; j++)
          bad += !isnormal(W[j]); // check for nan, inf, etc.
      if (bad) {
         // ... handle error
      }
    

    Putting this in a separate, simple, loop makes it easier for the compiler to optimize (in theory), especially if you use something like #pragma GCC optimize ("unroll-loops").