Checking the performance for disk-based merge sort with auxiliary storage

I try to implement disk-based merge sort with auxiliary storage. The implementation is like below.

fd - file descriptor for dataset will be sorted

fd2 - file descriptor for auxiliary storage

#define LENGTH 100
#define SEARCH_BEGIN 4
int merge_sort_d(int fd, int fd2, int s, int e) {
    int i, m;
    int l, r;
    char lv[LENGTH], rv[LENGTH];
    char buf[LENGTH];
    if (s >= e) return 1;
    m = (s + e) / 2;
    merge_sort_d(fd, fd2, s, m);
    merge_sort_d(fd, fd2, m+1, e);
    l = s;
    r = m+1;
    memset(lv, 0, LENGTH);
    memset(rv, 0, LENGTH);
    lseek(fd2, 0, SEEK_SET);
    while (l <= m && r <= e) {
        lseek(fd, 1LL*SEARCH_BEGIN + 1LL*l*LENGTH, SEEK_SET);
        read(fd, (void *)lv, LENGTH);
        lseek(fd, 1LL*SEARCH_BEGIN + 1LL*r*LENGTH, SEEK_SET);
        read(fd, (void *)rv, LENGTH);
        if (strncmp(lv, rv, LENGTH) < 0) {
            write(fd2, (void *)lv, LENGTH);
            ++l;
        } else {
            write(fd2, (void *)rv, LENGTH);
            ++r;
        }
    }
    for (; l <= m; ++l) {
        lseek(fd, 1LL*SEARCH_BEGIN + 1LL*l*LENGTH, SEEK_SET);
        read(fd, (void *)lv, LENGTH);
        write(fd2, (void *)lv, LENGTH);
    }
    for (; r <= e; ++r) {
        lseek(fd, 1LL*SEARCH_BEGIN + 1LL*r*LENGTH, SEEK_SET);
        read(fd, (void *)rv, LENGTH);
        write(fd2, (void *)rv, LENGTH);
    }
    lseek(fd, 1LL*SEARCH_BEGIN + 1LL*s*LENGTH, SEEK_SET);
    lseek(fd2, 0, SEEK_SET);
    memset(buf, 0, LENGTH);
    for (i=s; i<=e; ++i) {
        read(fd2, (void *)buf, LENGTH);
        write(fd, (void *)buf, LENGTH);
    }
    return 1;
}

After implementing disk-based merge sort i have tested some small cases to check whether it runs correctly. It looks fast enough in small cases but when running it with large dataset over 20G (later the final size is over 500G). It takes 2 hours and i confuse it really runs in O(nlogn). Of course there exists some additional times with disk-based algorithm and data structure.

I am curious does it really runs in O(nlogn).

Solution

The standard in-memory merge sort does log(n) passes over the data, merging successively larger lists with each pass. In the first pass, you merge lists that contain one item each. Next, its lists with two items each, then four, etc. With this method, you make log(n) passes over the data, and you look at n items during each pass. Thus the O(n log n) complexity.

That method is very effective for in-memory sorts, because the cost of accessing an item is not very high. For disk-based sorts, though, it becomes very expensive because the cost of accessing each item is very high. Basically, you're reading and writing the entire file log(n) times. If you have 20 GB of 100-byte records, you're talking 25 or more passes over the file. So your sorting time will be dominated by the time it takes to read and write the file 25 times.

An external sort is a much different animal. The idea is to reduce disk I/O as much as possible. You do it in two passes. In the first pass, you read as much of the data as you can into memory and sort it using Quicksort, merge sort, or another in-memory sorting algorithm, and write that chunk to disk. Then you read the next chunk of the file into memory, sort it, and write to disk.

When you're done with the first pass, you have some number of sorted chunks out on disk. If you have a 20 GB file and you have 4 GB of memory free on your machine, then you'll have five chunks, each about 4 GB in size. (Note that you'd probably actually have five chunks that are slightly smaller than 4 GB, and a sixth, very small, chunk). Call the number of chunks k.

Note that after the first pass is done, you've read and written each record one time.

In the second pass, you merge the multiple chunks. This is done with a heap of k items. The idea is that you initialize the heap with the first item from each chunk. You select the smallest of those k items (which is at the top of the heap), and write it to the output file. Then, you take the next item from the chunk that contained the item you just removed, and add it to the heap. You repeat this procedure until you've emptied all the chunks.

The first pass is O(n log n). Actually, it's k*((n/k) log(n/k)), which works out to n log n. The second pass is O(n log k).

The important part here is that in the second pass, you once again read and wrote each item one time. You've reduced your disk I/O from reading and writing each item log(n) times to reading and writing each item twice. That will run much faster than the code you wrote.

It's also important to note that both algorithms are indeed considered O(n log n). The I/O constant is the killer. The second algorithm actually does more computation, but it saves so much disk I/O time that it's faster than the theoretically faster algorithm.

The External sorting article on Wikipedia gives a decent explanation, and the GeeksforGeeks article gives a working example in C++.