Analyzing readdir() performance

It's bothering me that linux takes so long to list all files for huge directories, so I created a little test script that recursively lists all files of a directory:

#include <stdio.h>
#include <dirent.h>

int list(char *path) {
    int i = 0;

    DIR *dir = opendir(path);
    struct dirent *entry;
    char new_path[1024];

    while(entry = readdir(dir)) {
        if (entry->d_type == DT_DIR) {
            if (entry->d_name[0] == '.')
                continue;

            strcpy(new_path, path);
            strcat(new_path, "/");
            strcat(new_path, entry->d_name);
            i += list(new_path);
        }
        else
            i++;
    }
    closedir(dir);

    return i;
}

int main() {
    char *path = "/home";
    printf("%i\n", list(path));
    return 0;

When compiling it with gcc -O3, the program runs about 15 sec (I ran the programm a few times and it's approximately constant, so the fs cache should not play a role here):

$ /usr/bin/time -f "%CC %DD %EE %FF %II %KK %MM %OO %PP %RR %SS %UU %WW %XX %ZZ %cc %ee %kk %pp %rr %ss %tt %ww %xx" ./a.out
./a.outC 0D 0:14.39E 0F 0I 0K 548M 0O 2%P 178R 0.30S 0.01U 0W 0X 4096Z 7c 14.39e 0k 0p 0r 0s 0t 1692w 0x

So it spends about S=0.3sec in kernelspace and U=0.01sec in userspace and has 7+1692 context switches. A context switch takes about 2000nsec * (7+1692) = 3.398msec [1] However, there are more than 10sec left and I would like to find out what the program is doing in this time. Are there any other tools to investigate what the program is doing all the time? gprof just tells me the time for the (userspace) call graph and gcov does not list time spent in each line but only how often a time is executed...

[1] http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

Solution

oprofile is a decent sampling profiler which can profile both user and kernel-mode code.

According to your numbers, however, approximately 14.5 seconds of the time is spent asleep, which is not really registered well by oprofile. Perhaps what may be more useful would be ftrace combined with a reading of the kernel code. ftrace provides trace points in the kernel which can log a message and stack trace when they occur. The event that would seem most useful for determining why your process is sleeping would be the sched_switch event. I would recommend that you enable kernel-mode stacks and the sched_switch event, set a buffer large enough to capture the entire lifetime of your process, then run your process and stop tracing immediately after. By reviewing the trace, you will be able to see every time your process went to sleep, whether it was runnable or non-runnable, a high resolution time stamp, and a call stack indicating what put it to sleep.

ftrace is controlled through debugfs. On my system, this is mounted in /sys/kernel/debug, but yours may be different. Here is an example of what I would do to capture this information:

# Enable stack traces
echo "1" > /sys/kernel/debug/tracing/options/stacktrace
# Enable the sched_switch event
echo "1" > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
# Make sure tracing is enabled
echo "1" > /sys/kernel/debug/tracing/tracing_on
# Run the program and disable tracing as quickly as possible
./your_program; echo "0" > /sys/kernel/debug/tracing/tracing_on
# Examine the trace
vi /sys/kernel/debug/tracing/trace

The resulting output will have lines which look like this:

# tracer: nop
#
# entries-in-buffer/entries-written: 22248/3703779   #P:1
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          <idle>-0     [000] d..3  2113.437500: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:0 next_pid=878 next_prio=120
          <idle>-0     [000] d..3  2113.437531: <stack trace>
 => __schedule
 => schedule
 => schedule_preempt_disabled
 => cpu_startup_entry
 => rest_init
 => start_kernel
     kworker/0:0-878   [000] d..3  2113.437836: sched_switch: prev_comm=kworker/0:0 prev_pid=878 prev_prio=120 prev_state=S ==> next_comm=your_program next_pid=898 next_prio=120
     kworker/0:0-878   [000] d..3  2113.437866: <stack trace>
 => __schedule
 => schedule
 => worker_thread
 => kthread
 => ret_from_fork

The lines you will care about will be when your program appears as the prev_comm task, meaning the scheduler is switching away from your program to run something else. prev_state will indicate that your program was still runnable (R) or was blocked (S, U or some other letter, see the ftrace source). If blocked, you can examine the stack trace and the kernel source to figure out why.