Search code examples
clinux-kernelfilesystemsvfs

How does the virtual filesystem handle syscalls like read and write?


(All of the code snippets are taken from: https://docs.huihoo.com/doxygen/linux/kernel/3.7/dir_97b3d2b63ac216821c2d7a22ee0ab2b0.html)

Hi! To establish my question I have been looking at the Linux fs code for almost a month now for research and I am stuck here. So I am looking at this code in include/linux/fs.h (which if I am not wrong has the definitions of almost all major structures and pointers used by codes like read_write.c and open.c) and I observe this code snippet:

struct file_operations {
 1519     struct module *owner;
 1520     loff_t (*llseek) (struct file *, loff_t, int);
 1521     ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
 1522     ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 1523     ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 1524     ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 1525     int (*readdir) (struct file *, void *, filldir_t);
 1526     unsigned int (*poll) (struct file *, struct poll_table_struct *);
 1527     long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 1528     long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 1529     int (*mmap) (struct file *, struct vm_area_struct *);
 1530     int (*open) (struct inode *, struct file *);
 1531     int (*flush) (struct file *, fl_owner_t id);
 1532     int (*release) (struct inode *, struct file *);
 1533     int (*fsync) (struct file *, loff_t, loff_t, int datasync);
 1534     int (*aio_fsync) (struct kiocb *, int datasync);
 1535     int (*fasync) (int, struct file *, int);
 1536     int (*lock) (struct file *, int, struct file_lock *);
 1537     ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
 1538     unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 1539     int (*check_flags)(int);
 1540     int (*flock) (struct file *, int, struct file_lock *);
 1541     ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 1542     ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 1543     int (*setlease)(struct file *, long, struct file_lock **);
 1544     long (*fallocate)(struct file *file, int mode, loff_t offset,
 1545               loff_t len);
 1546 };

Here as you can see they have defined these very specific syscalls which have been declared in their respective files. For example read_write.c has its definition of read and write syscalls as SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) and SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count) respectively. Now for research purposes, I basically went inside these two definitions and hunted down each and every function call (at least those that were linked in the Doxygen documentation) that happened inside each of them and the function calls inside those function calls but could not answer a very simple question. How do these two syscalls call the virtual filesystem to further call the drivers required to read actual blocks of data from the filesystem? (If it is filesystem-specific then please show me locations in the code where it is handing it off to the FS drivers)

P.S. I did the same hunt for the open syscall but was able to find the place where they invoked a part of namei.c code to perform that task specifically here: struct file *do_filp_open(int dfd, struct filename *pathname, const struct open_flags *op, int flags). here they use the structure nameidata that has the relevant information from the inode to open a file.


Solution

  • In-Kernel Filesystems in Linux

    In Linux, in-kernel filesystems are implemented in a modular fashion. For example, each struct inode contains a pointer to a struct file_operations, the same struct you copied in your question. This struct contains function pointers for various file operations.

    For example, the member ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); is a function pointer to a function that takes a struct file *, char *, size_t, and loff_t * as parameters, and returns a ssize_t.

    Routing syscalls to the underlying filesystem

    When the read system call occurs, the kernel VFS code finds the corresponding inode, and then calls the filesystem's read function that is specified in the struct file_operations. Here's a trace of the read system call:

    1. the read() syscall handler is invoked,
    2. which calls ksys_read(),
    3. which calls vfs_read().

    This is where the magic happens in vfs_read():

    if (file->f_op->read)
        ret = file->f_op->read(file, buf, count, pos);
    else if (file->f_op->read_iter)
        ret = new_sync_read(file, buf, count, pos);
    else
        ret = -EINVAL;
    

    A related struct, struct file, also contains a pointer to a struct file_operations. The above if-condition checks if there is a read() handler for this file, and calls it if it exists. If a read() handler doesn't exist, it checks for a read_iter handler. If neither exists, it returns -EINVAL.

    Example: ext4

    In ext4, the struct file_operations is defined here. It is used in several places, but it is associated with an inode here. ext4 defines a read_iter handler (ie. ext4_file_read_iter), but not a read handler. So, when read(2) is called on an ext4 file, ext4_file_read_iter() is eventually called.

    At this point, we've gotten to filesystem specific code. How ext4 manages blocks can be explored further from here.