Mystery function _L_lock_154 when profiling with gprof

When profiling my code, gprof is outputting the following:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 30.69     10.55    10.55                             _L_lock_154
 16.58     16.25     5.70 1126059616     0.00     0.00  double_cmp
 14.25     21.15     4.90    13737     0.36     0.36  bsdi
 10.01     24.59     3.44                             memcpy

(to take the 1st few lines only)

taking the top spot is a function i do not recognise, and neither does google unfortunately.

what could it be - does anyone know? as its taking up the lions share of my time i would be keen to find out.

using mikes suggestion for DIY profiling, I see this stack backtrace quite often:

Program received signal SIGINT, Interrupt.
0xb7fff424 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7fff424 in __kernel_vsyscall ()
#1  0x080ada62 in __write_nocancel ()
#2  0x080859c1 in _IO_new_file_write ()
#3  0x08084b64 in new_do_write ()
#4  0x080861ed in _IO_new_do_write ()
#5  0x080869c5 in _IO_new_file_overflow ()
#6  0x08085f3e in _IO_new_file_xsputn ()
#7  0x080c4d54 in vfprintf ()
#8  0x080b071c in __fprintf_chk ()
#9  0x0805ec36 in fprint_track_hum ()
#10 0x0805efe1 in fprint_hash_tracks ()
#11 0x08049c33 in main (argc=22, argv=0xbfffeac4) at harness.c:537

although i don't see the call to _L_lock_154, I'm starting to think it might be something printing to file/stdio.h related - a lock on writing to a file pointer or some such. So I'll try and disable all printing and re-profile, see if the mystery function disappears..

UPDATE #1

no - its still a bit of a mystery, disabled all output to file/stdout but the sinistrous _L_lock_154 is still taking up 10% of my time. Could be something to do with reading from file but I can't run my harness without some input..

I'm really surprised google draws a blank on this - very rare..

UPDATE #2

just ran gprof in call graph rather than flat mode - it spits out this for _L_lock_154:

granularity: each sample hit covers 4 byte(s) for 0.03% of 36.86 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     49.6   11.82    6.45                 _L_lock_154 [1]
                5.84    0.00 1198163721/1198163721     double_cmp [8]
                0.52    0.00 125628587/125628587     fptype_cmp [24]
                0.04    0.00 13096233/13096233     grp_cmp_by_density_descending [52]
                0.04    0.00 3464916/3464916     pdw_ptr_cmp_by_amp [53]
                0.01    0.00   73812/73812       pdw_ptr_cmp_by_idx [89]
                0.00    0.00   71682/9288620     int_cmp [44]
                0.00    0.00  636314/842100      pri_ptr_cmp_by_second_pdw_idx [123]
                0.00    0.00  277089/407783      pri_ptr_cmp_by_first_pdw_idx [124]
                0.00    0.00   76178/76178       window_cmp_by_emitter_id [128]
                0.00    0.00   10147/10147       pri_ptr_cmp_by_first_pdw_toa [137]
-----------------------------------------------

I'm sure this is trying to tell me something, but I'm not sure how to interpret it? All my cmp type functions are for use with qsort from stdlib - if _L_lock_154 is the parent to all these calls does it imply it is an alias for qsort?

Solution

OK, gprof samples on CPU time, while random pausing samples on wall-clock time. So when gprof says that routine takes 30% of time, it means CPU-time, which may well be less than 1% of wall-clock time. In other words, you are I/O-bound. So even if you could make it take 0 time, you would never see the difference. (That's why I disparage "CPU-profiling" - it leads less-experienced programmers to focus their attention on things that don't matter.)

In case this isn't clear, your program may do actual computing for 1 microsecond, of which a large part consists of going into and coming out the the I/O stack of routines. Once it reaches the bottom of the I/O stack, it may be waiting for 99 microseconds before coming out and using the CPU for another 1 microsecond. gprof is only telling you about how that 1% of time is used, as if that was all that mattered.

To get comparable results to gprof in random-pausing, you would have to discard any samples that contain __kernel_vsyscall. So you might have to take 100 samples before you found one that was actually using CPU time. That's a lot of work, but if you did it, I expect that you would see the Lock routine on about 30% of the non-I/O samples.

Remember, when gprof was written, it was by people working at UC Berkeley, an academic environment where a version of unix was being developed. It had a built-in assumption - that all I/O was necessary I/O. In real user-level software, that is very often not the case. Besides, I/O is not non-CPU time, it's just a different CPU. There's an I/O controller chip running that disk, port, or what-have-you.