Search code examples
clinuxundefined-behavior

Why can't linux write more than 2147479552 bytes?


In man 2 write the NOTES section contains the following note:

On Linux, write() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)

  1. Why is that?
  2. The DESCRIPTION path has the following sentence:

According to POSIX.1, if count is greater than SSIZE_MAX, the result is implementation-defined

SSIZE_MAX is way bigger than 0x7ffff000. Why is this note there?

Update: Thanks for the answer! In case anyone is interested (and for better SEO to help developers out here), all functions with that limititations are:

  • read
  • write
  • sendfile

To find this out one just has to full text search the manual:

 % man -wK "0x7ffff000"
/usr/share/man/man2/write.2.gz
/usr/share/man/man2/read.2.gz
/usr/share/man/man2/sendfile.2.gz
/usr/share/man/man2/sendfile.2.gz

Solution

  • Why is this here?

    I don't think there's necessarily a good reason for this - I think this is basically a historical artifact. Let me explain with some git archeology.

    In current Linux, this limit is governed by MAX_RW_COUNT:

    ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
    {
        [...]
        if (count > MAX_RW_COUNT)
            count =  MAX_RW_COUNT;
    

    That constant is defined as the AND of the integer max value and the page mask. This is roughly equal to the max integer size minus the size of one page.

    #define MAX_RW_COUNT (INT_MAX & PAGE_MASK)
    

    So that's where 0x7ffff000 comes from - your platform has pages which are 4096 bytes wide, which is 212, so it's the max integer value with the bottom 12 bits unset.

    The last commit to change this, ignoring commits which just move things around, was e28cc71572da3.

    Author: Linus Torvalds <[email protected]>
    Date:   Wed Jan 4 16:20:40 2006 -0800
    
        Relax the rw_verify_area() error checking.
        
        In particular, allow over-large read- or write-requests to be downgraded
        to a more reasonable range, rather than considering them outright errors.
        
        We want to protect lower layers from (the sadly all too common) overflow
        conditions, but prefer to do so by chopping the requests up, rather than
        just refusing them outright.
    

    So, this gives us a reason for the change: to prevent integer overflow, the size of the write is capped at a size near the maximum integer. Most of the surrounding logic seems to have been changed to use longs or size_t's, but the check remains.

    Before this change, giving it a buffer larger than INT_MAX would result in an EINVAL error:

    if (unlikely(count > INT_MAX))
            goto Einval;
    

    Before this, instead of a hardcoded limit, each file had a max IO size, set by the filesystem.

        if (unlikely(count > file->f_maxcount))
             goto Einval;
    

    The addition of the per-filesystem limit is described in this email.

    However, no filesystem ever changed the max count from INT_MAX, so this feature was never used before it was removed less than a year later. I cannot find any discussion of why this feature was added.

    Is this POSIX compliant?

    Putting on my standards lawyer hat, I think this is actually POSIX compliant. Yes, POSIX does say that writes larger than SSIZE_MAX are implementation-defined behavior, and this is not larger than that limit. However, there are two other sentences in the standard which I think are important:

    The write() function shall attempt to write nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes.
    [...]
    Upon successful completion, write() and pwrite() shall return the number of bytes actually written to the file associated with fildes. This number shall never be greater than nbyte. Otherwise, -1 shall be returned and errno set to indicate the error.

    The partial write is explicitly allowed by the standard. For this reason, all code which calls write() needs to wrap calls to write() in a loop which retries short writes.

    Should the limit be raised?

    Ignoring the historical baggage, and the standard, is there a reason to raise this limit today?

    I'd argue the answer is no. The optimal size of the write() buffer is a tradeoff between trying to avoid excessive context switches between kernel and userspace, and ensuring your data fits into cache as much as possible.

    The coreutils programs (which provide cat, cp, etc) use a buffer size of 128KiB. The optimal size for your hardware might be slightly larger or smaller. But it's unlikely that 2GB buffers are going to be faster.