Search code examples
cperformanceglibcscanf

Why is glibc's sscanf vastly slower than fscanf on Linux?


I am using GCC 4.8 and glibc 2.19 on an x86_64 Linux.

While playing with different input methods for a different question, I compared fscanf and sscanf. Specifically, I would either use fscanf on the standard input directly:

char s[128]; int n;

while (fscanf(stdin, "%127s %d", s, &n) == 2) { }

Or I would first read the entire input into a buffer and then traverse the buffer with sscanf. (Reading everything into the buffer takes a tiny amount of time.)

char s[128]; int n;
char const * p = my_data;

for (int b; sscanf(p, "%127s %d%n", s, &n, &b) == 2; p += b) { }

To my surprise, the fscanf version is vastly faster. For example, processing seveal tens of thousands of lines with fscanf takes this long:

10000       0.003927487 seconds time elapsed
20000       0.006860206 seconds time elapsed
30000       0.007933329 seconds time elapsed
40000       0.012881912 seconds time elapsed
50000       0.013516816 seconds time elapsed
60000       0.015670432 seconds time elapsed
70000       0.017393129 seconds time elapsed
80000       0.019837480 seconds time elapsed
90000       0.023925753 seconds time elapsed

Now the same with sscanf:

10000       0.035864643 seconds time elapsed
20000       0.127150772 seconds time elapsed
30000       0.319828373 seconds time elapsed
40000       0.611551668 seconds time elapsed
50000       0.919187459 seconds time elapsed
60000       1.327831544 seconds time elapsed
70000       1.809843039 seconds time elapsed
80000       2.354809588 seconds time elapsed
90000       2.970678416 seconds time elapsed

I was using the Google perf tools to measure this. For example, for 50000 lines, the fscanf code requires about 50M cycles, and the sscanf code about 3300M cycles. So I broke down the top call sites with perf record/perf report. With fscanf:

 35.26%  xf  libc-2.19.so         [.] _IO_vfscanf
 23.91%  xf  [kernel.kallsyms]    [k] 0xffffffff8104f45a
  8.93%  xf  libc-2.19.so         [.] _int_malloc

And with sscanf:

 98.22%  xs  libc-2.19.so         [.] rawmemchr
  0.68%  xs  libc-2.19.so         [.] _IO_vfscanf
  0.38%  xs  [kernel.kallsyms]    [k] 0xffffffff8104f45a

So almost all of the time with sscanf is spent in rawmemchr! Why is this? How can the fscanf code avoid this cost?

I tried searching for this, but the best I could come up with is this discussion of locked realloc calls which I don't think applies here. I was also thinking that fscanf has better memory locality (using the same buffer over and over), but that can't make such a big difference.

Does anyone have any insights in this strange discrepancy?


Solution

  • sscanf() converts the string you pass in to an _IO_FILE* to make the string look like a "file". This is so the same internal _IO_vfscanf() can be used for both a string and a FILE*.

    However, as part of that conversion, done in a _IO_str_init_static_internal() function, it calls __rawmemchr (ptr, '\0'); essentially a strlen() call, on your input string. This conversion is done on every call to sscanf(), and since your input buffer is rather large, it'll spend a fair amount of time calculating the length of the input string.

    Creating a FILE* from the input string using fmemopen() and use fscanf() could be another alternative.