Search code examples
linux-kernelsystem-callsmanpage

In the linux_dirent64 structs written in the Linux syscall getdents64, why is d_off not the sum of the d_reclens of all earlier entries?


According the man page of getdents:

d_off is the distance from the start of the directory to the start of the next linux_dirent. d_reclen is the size of this entire linux_dirent.

So I would expect that if the first entry has d_reclen n, its d_off would also be n (and for the i-th entry, d_off would be the sum of the d_reclens of all entries from 0 to i, inclusive).

However, in that same man page, a nicely printed table with the entries of an example directory looks like this:

       --------------- nread=120 ---------------
       inode#    file type  d_reclen  d_off   d_name
              2  directory    16         12  .
              2  directory    16         24  ..
             11  directory    24         44  lost+found
             12  regular      16         56  a
         228929  directory    16         68  sub
          16353  directory    16         80  sub2
         130817  directory    16       4096  sub3

The d_off fields of the entries do not seem to follow the rule as I expected. If the first entry has size 16, surely the offset from the start to the second entry would be 16, but apparently it's actually 12.

So what don't I understand about the d_off field of linux_dirent64?


Solution

  • It's explained vaguely in the manual page, but as you can probably see by compiling and running the example program, your assumption does not hold.

    The manual page for readdir(3) gives a bit more insight:

    d_off  The value returned in d_off is the same as would be returned by
           calling telldir(3) at the current position in the directory
           stream.  Be aware that despite its type and name, the d_off field
           is seldom any kind of directory offset on modern filesystems.
           Applications should treat this field as an opaque value, making no
           assumptions about its contents; see also telldir(3).
    

    The key part is "the d_off field is seldom any kind of directory offset on modern filesystems". The d_off field is a value for internal use by the underlying filesystem, and its meaning is implementation-specific. It does not necessarily have any correlation with d_reclen, nor does it need to represent an actual "offset" in memory. Whatever software you write, you should not rely on the value of d_off and consider it like an opaque identifier.

    There may be filesystems where d_off corresponds to an actual offset in bytes between dirents, but this is in general not the case. The field is used more or less like a unique "counter" or "cookie" value to distinguish files inside a directory.

    In fact, if you take a look at the values on a Btrfs filesystem, d_off seems to start at 1 for . and 2 for .., increasing by one for any following dirent, with the last one having d_off equal to INT32_MAX. At least for a directory with fresh newly created files, things will change after deleting/moving/creating more files.

    $ mkdir test
    $ cd test
    $ touch a b c d e f
    $ ls -l
    total 0
    -rw-r----- 1 marco marco 0 gen 15 01:20 a
    -rw-r----- 1 marco marco 0 gen 15 01:20 b
    -rw-r----- 1 marco marco 0 gen 15 01:20 c
    -rw-r----- 1 marco marco 0 gen 15 01:20 d
    -rw-r----- 1 marco marco 0 gen 15 01:20 e
    -rw-r----- 1 marco marco 0 gen 15 01:20 f
    
    $ ../test_program
    --------------- nread=192 ---------------
    inode#    file type  d_reclen  d_off   d_name
    46206659  directory    24          1  .
      214242  directory    24          2  ..
    46206662  regular      24          3  a
    46206663  regular      24          4  b
    46206664  regular      24          5  c
    46206665  regular      24          6  d
    46206666  regular      24          7  e
    46206667  regular      24 2147483647  f
    

    This 2004 Sourceware bug report for Glibc by Dan Tsafrir also contains some insightful explanations about d_off, such as:

    • In the implementation of getdents(), the d_off field (belonging to the linux kernel's dirent structure) is falsely assumed to contain the byte offset to the next dirent. Note that the linux manual of the readdir system-call states that d_off is the "offset to this dirent" while glibc's getdents treats it as the offset to the next dirent.

    • In practice, both of the above are wrong/misleading. The d_off field may contain illegal negative values, 0 (should also never happen as the "next" dirent's offset must always be bigger then 0), or positive values that are bigger than the size of the directory-file itself:

      • We're not sure what the Linux kernel intended to place in this field, but our experience shows that on "real" file systems (that actually reside on some disk) the offset seems to be a simple (not necessarily continuous) counter: e.g. first entry may have d_off=1, second: d_off=2, third: d_off=4096, fourth=d_off=4097 etc. We conjecture this is the serial of the dirent record within the directory (and so, this is indeed the "offset", but counted in records out of which some were already removed).

      • For file systems that are maintained by the amd automounter (automount, directories) the d_off seems to be arbitrary (and may be negative, zero or beyond the scope of a 32bit integer). We conjecture the amd doesn't assign this field and the received values are simply garbage.