Search code examples
gnu-sort

gnu sort - default buffer size


I have read the full documentation for gnu sort and searched online but I cannot find what the default for the --buffer-size option is (which determines how much system memory the program uses when it runs). I am guessing it is somehow determined based on total system memory? (or perhaps on memory available at the time the program is begins execution?). How can I determine this?

update: I've experimented a bit and it seems that when I don't specify a particular --buffer-size value, it ends up using very little ram and thus going very slowly. It would be nice though to better understand what exactly is determining this behavior.


Solution

  • I went digging through the coreutils sort source code and found these functions: default_sort_size and sort_buffer_size.

    It turns out that --buffer-size (sort_size in the source code) isn't the target buffer size but rather the maximum buffer size. If no --buffer-size value is specified, the default_sort_size function is used to determine a safe maximum buffer size. It does this based on resource limits, available memory, and total memory. A summary of the function is as follows:

    size = MIN(SIZE_MAX, resource_limit) / 2;
    mem  = MAX(available_memory, total_memory / 8);
    
    if ( size > total_memory * 0.75 )
        size = total * 0.75;
    
    buffer_max = MIN(mem, size);
    buffer_max = MAX(buffer, MIN_SORT_SIZE);
    

    The other function, sort_buffer_size, is used to determine exactly how much memory to allocate for the given input files. A summary of the function is as follows:

    if (sort_size is set)
        size_bound = sort_size;
    else
        size_bound = default_sort_size();
    
    buffer_size = line_bytes + 2;
    
    for each input_file
        if (input_file is regular)
            file_size = input_file_size;
        else
            if (sort_size is set)
                return sort_size;
            else
                file_size = guess;
    
        worst_case = file_size * worst_case_per_input_byte + 1;
    
        if (worst_case overflows || size + worst_case >= size_bound)
            return size_bound;
        else
            size += worst_case;
    
    return size;
    

    Possibly the most important point of the sort_buffer_size function is that if you're sorting data from STDIN or a pipe, it will automatically default to sort_size (i.e. --buffer-size) if it was provided. Otherwise, for regular files it will make some rough calculations based on the file sizes and only use sort_size as an upper limit.