I'm learning about in-kernel data transferring between two file descriptors in Linux and came across something I cannot understand. Here is the quote from copy_file_range manpage
copy_file_range()
gives filesystems an opportunity to implement "copy acceleration" techniques, such as the use of reflinks (i.e., two or more i-nodes that share pointers to the same copy-on-write disk blocks) or server-side-copy
I used to think of index nodes as something that is returned by stat
/statx
syscall. The st_ino
type is typedef
ed here as
typedef unsigned long __kernel_ulong_t;
So what does it ever mean "two or more i-nodes that share pointers to the same copy-on-write disk blocks"?
According to my understanding the fact that copy_file_range
do not need to pass the data through the user-mode means the kernel doesn't have to load the data from the disk at all (it still might but it doesn't have to) and this allows further optimization by pushing the operation down the file-system stack. This covers the case of the server-side-copy over NFS.
The actual answers about the other optimization starts with an intro into how files are stored, you may skip it if you already know that.
There are 3 layers in how files are stored in a typical Linux FS:
The file entry in some directory (which is itself a file containing a list of such entries). Such entry essentially maps file name to some inode. It is done by storing the inode-number aka st_ino
which is effectively a pointer to the inode in some table.
The inode that contains some shared (see further) metadata (as the one returned by stat
) and some pointer(s) to data block(s) that store the actual file contents.
The actual data blocks
So for example a hard-link is a record in some directory that points to the same inode as the "original" file (and incrementing the "link counter" inside the inode). This means that only file names (and possibly directories) are different, all the rest of the data and meta-data is shared between hard-links. Note that creating a hard link is a very fast way to copy a file. The only drawback is that both files now are bound to share their contents forever so this is not a true copy. But if we used some copy-on-write method to fix the "write" part, it would work very nice. This is what some FSes (such as Btrfs) support via reflinks.
The idea of this copy-on-wrote trick is that you can create a new inode with new appropriate metadata but still share the same data blocks. You also add cross-references between the two inodes in the "invisible" part of the inode metadata so they know they share the data blocks. Obviously this operation is very fast comparing to the real copying. And again as long as the files are only read, everything works perfectly. But unlike hard-link we can deal with writes treating them as independent as well. When some write is performed, the FS checks if the file (or rather the inode) is really the only owner of the data blocks and else copies the data before writing to it. Depending on the FS implementation it can copy the whole file on the first write or it can store some more detailed metadata and only copy the blocks that have to be modified and still share the rest between the files. In the later case blocks might not need to be copied at all if the write size is more than a block.
So the simplest trick copy_file_range()
can do is to check if the whole file is actually being copied and if so, to perform the reflink trick described above (obviously if the FS supports it).
Some more advanced optimizations are also possible if the FS supports more detailed meta-data on data blocks. Assume you copy first N bytes from the start of the file into a new file. Then the FS can just share the starting blocks and probably has to copy only the last one that is not fully copied.