I have problems with write performance of fseek()
/fwrite()
on my Mac. I'm operating on large files up to 4 GB of size, tests below were made with a rather small one with only 120 MB. My strategy is as follows:
fopen()
a new file on diskThe whole procedure takes around 120 seconds.
The write strategy is bound to an image rotation algorithm (see my question here) and unless someone comes up with a faster solution for the rotation problem, I'm not able to change the strategy of using fseek()
and then writing 4k or less to the file.
What I am observing is this: The first few thousand fseek()
/fwrite()
perform quite well, but the performance drops very fast, faster than you would expect from any system cache being filled up. The chart below shows fwrite()
s per second vs time in seconds. As you see, after 7 seconds the fseek()
/fwrite()
rate reaches approx. 200 per second, still going down until it reaches 100 per second at the very end of the process.
In the middle of the process (2 or 3 times), the OS decides to flush file contents to disk which I can see from my console output hanging a few seconds, during that time I have approx. 5 MB/s write on my disk (which isn't that much). After fclose()
the system seems to write the whole file, I see 20 MB/s disk activity for a longer period of time.
If I use fflush()
every 5.000 fwrite()
s, the behaviour doesn't change at all. Putting in fclose()
/fopen()
to force flushing somehow speeds up the whole thing by approx. 10%.
I did profile the process (screenshot below) and you see, that virtually all time is spent inside fwrite()
and fseek()
which can be drilled down to __write_nocancel()
for both of them.
Completely absurd summary
Imagine the case where my input data fits into my buffers completely and thus I'm able to write my rotated output data linearly without the need to split the write process into fragments. I still use fseek()
to position the file pointer, just because the logic of the writing function behaves that way, but the file pointer in this case is set to the same position where it already was. One would expect no performance impact. Wrong.
What is absurd is, if I remove the calls to fseek()
for that special case, my function finishes within 2.7 seconds instead of 120 seconds.
Now, after a long foreword, the question is: Why does fseek()
have such an impact on performance, even if I seek to the same position? How could I speed it up (by another strategy or other function calls, disabling caching if possible, memory mapped access, ...)?
For reference, here's my code (not tidied up, not optimized, containing lots of debug output):
-(bool)writeRotatedRaw:(TIFF*)tiff toFile:(NSString*)strFile
{
if(!tiff) return NO;
if(!strFile) return NO;
NSLog(@"Starting to rotate '%@'...", strFile);
FILE *f = fopen([strFile UTF8String], "w");
if(!f)
{
NSString *msg = [NSString stringWithFormat:@"Could not open '%@' for writing.", strFile];
NSRunAlertPanel(@"Error", msg, @"OK", nil, nil);
return NO;
}
#define LINE_CACHE_SIZE (1024*1024*256)
int h = [tiff iImageHeight];
int w = [tiff iImageWidth];
int iWordSize = [tiff iBitsPerSample]/8;
int iBitsPerPixel = [tiff iBitsPerSample];
int iLineSize = w*iWordSize;
int iLinesInCache = LINE_CACHE_SIZE / iLineSize;
int iLinesToGo = h, iLinesToRead;
NSLog(@"Creating temporary file");
double time = CACurrentMediaTime();
double lastTime = time;
unsigned char *dummy = calloc(iLineSize, 1);
for(int i=0; i<h; i++) fwrite(dummy, 1, iLineSize, f);
free(dummy);
fclose(f);
f = fopen([strFile UTF8String], "w");
NSLog(@"Created temporary file (%.1f MB) in %.1f seconds", (float)iLineSize*(float)h/1024.0f/1024.0f, CACurrentMediaTime()-time);
fseek(f, 0, SEEK_SET);
lastTime = CACurrentMediaTime();
time = CACurrentMediaTime();
int y=0;
unsigned char *ucRotatedPixels = malloc(iLinesInCache*iWordSize);
unsigned short int *uRotatedPixels = (unsigned short int*)ucRotatedPixels;
unsigned char *ucLineCache = malloc(w*iWordSize*iLinesInCache);
unsigned short int *uLineCache = (unsigned short int*)ucLineCache;
unsigned char *uc;
unsigned int uSizeCounter=0, uMaxSize = iLineSize*h, numfwrites=0, lastwrites=0;
while(iLinesToGo>0)
{
iLinesToRead = iLinesToGo;
if(iLinesToRead>iLinesInCache) iLinesToRead = iLinesInCache;
for(int i=0; i<iLinesToRead; i++)
{
// read as much lines as fit into buffer
uc = [tiff getRawLine:y+i withBitsPerPixel:iBitsPerPixel];
memcpy(ucLineCache+i*iLineSize, uc, iLineSize);
}
for(int x=0; x<w; x++)
{
if(iBitsPerPixel==8)
{
for(int i=0; i<iLinesToRead; i++)
{
ucRotatedPixels[iLinesToRead-i-1] = ucLineCache[i*w+x];
}
fseek(f, w*x+(h-y-1), SEEK_SET);
fwrite(ucRotatedPixels, 1, iLinesToRead, f);
numfwrites++;
uSizeCounter += iLinesToRead;
if(CACurrentMediaTime()-lastTime>1.0)
{
lastTime = CACurrentMediaTime();
NSLog(@"Progress: %.1f %%, x=%d, y=%d, iLinesToRead=%d\t%d", (float)uSizeCounter * 100.0f / (float)uMaxSize, x, y, iLinesToRead, numfwrites);
}
}
else
{
for(int i=0; i<iLinesToRead; i++)
{
uRotatedPixels[iLinesToRead-i-1] = uLineCache[i*w+x];
}
fseek(f, (w*x+(h-y-1))*2, SEEK_SET);
fwrite(uRotatedPixels, 2, iLinesToRead, f);
uSizeCounter += iLinesToRead*2;
if(CACurrentMediaTime()-lastTime>1.0)
{
lastTime = CACurrentMediaTime();
NSLog(@"Progress: %.1f %%, x=%d, y=%d, iLinesToRead=%d\t%d", (float)uSizeCounter * 100.0f / (float)uMaxSize, x, y, iLinesToRead, numfwrites);
}
}
}
y += iLinesInCache;
iLinesToGo -= iLinesToRead;
}
free(ucLineCache);
free(ucRotatedPixels);
fclose(f);
NSLog(@"Finished, %.1f s", (CACurrentMediaTime()-time));
return YES;
}
I'm a bit lost because I do not understand how the system "optimizes" my calls. Any input is appreciated.
Just to somehow close this question, I'll answer it myself and share my solution.
Although I wasn't able to improve the performance of the fseek()
calls, I did implement a well performing workaround. The aim was to avoid fseek()
at any cost. Because I need to write fragments of data to different positions of the target file but those fragments appear in equal distance and the gaps between those fragments will be filled with other fragments written somewhat later in the process, I splitted the writing into multiple files. I write to as many files as fragment streams are generated and then, in a last step, re-open all those temporary files, read them rotational and linearly write data blocks to the target file. The performance of this is good, reaching approx. 4 seconds for the example given above.