Search code examples
d

How do I do stdin.byLine, but with a buffer?


I'm reading multi-gigabyte files and processing them from stdin. I'm reading from stdin like this.

  string line;
  foreach(line1; stdin.byLine){
    line = to!string(line1);
    ...
  }

Is there a faster way to do this? I tried a threading approach with

  auto childTid = spawn(&fn, thisTid);
  string line;
  foreach(line1; stdin.byLine){
    line = to!string(line1);
    receiveOnly!(int);
    send(childTid, line);
  }
  int x= 0;
  send(childTid, x);

That allows it to load at least one more line from disk while my process is running at the cost of a copy operation, but this is still silly, what I need is fgets, or a way to combine stdio.byChunk(4096) with readline. I tried fgets.

char[] buf = new char[4096];
fgets(buf.ptr, 4096, stdio)

but it always fails with stdio is a file and not a stream. Not sure how to make it a stream. Any help would be appreciated with the approach you think best. I'm not very good at D, apologies for any noob mistakes.


Solution

  • There are actually already two layers of buffering under the hood (excluding the hardware itself): the C runtime library and the kernel both do a layer of buffering to minimize I/O costs.

    First, the kernel keeps data from disk in its own buffer and will look ahead, loading beyond what you request in a single call if you are following a predictable pattern. This is to mitigate the low-level costs associated with seeking the device and will cache across processes - if you read a file with one program then again with a second, the second will probably get it from the kernel memory cache instead of the physical disk and may be noticeably much faster.

    Second, the C library, on which D's std.stdio is built, also keeps a buffer. readln ultimately calls C file I/O functions which read a chunk from the kernel at a time. (Fun fact, writes are also buffered by the C library, default by line if user interactive and by chunk otherwise. Writing is quite slow and doing it by chunk makes a big difference, but sometimes the C lib thinks a pipe isn't interactive when it is and leads to a FAQ: Simple D program Output order is wrong )

    These C lib buffers also mitigate the costs of many small reads and writes by batching them up before even sending to the kernel. In the case of readln, it will likely read several kilobytes at once, even if you ask for just one line or one byte, and the rest stays in the buffer for next time.

    So your readln loop is already going to be automatically buffered and should get decent I/O performance.

    You might be able to do it better yourself with a few techniques though. In that case, you may try using std.mmfile for a memory-mapped file and reading it as if i was an array, but your files are too big to fit in that on 32 bit. Might work on 64 bit though. (Note that a memory mapped file is NOT loaded all at once, it is just mapped to a memory address. When you actually touch part of it, the operating system will load/save on demand.)

    Or, of course, you can use the lower level operating system functions like write from import core.sys.posix.unistd or WriteFile from import core.sys.windows.windows, which will bypass the C lib's layers (but, of course, keep the kernel layers, which you want, don't try to bypass them.)

    You can look for any win32 or posix system call C tutorials if you want to know more about using those functions. It is the same in D as in C, with minor caveats like the import instead of #include.

    Once you load the chunk, you will want to scan it for the newline and slice it in all probability to form the range to pass to the loop or other algorithms. The std.range and std.algorithm modules also have searching, splitting, and chunking functions that might help, but you need to be careful with lines that span the edges of your buffers to keep correctness and efficiency.

    But if your performance is good enough as it is, I'd say just leave it - the C lib+kernel's buffering do a pretty good job in most cases.