Search code examples
stxxlexternal-sorting

stxxl sorting of very large file (ubuntu)


I am trying to sort a large file with about billion records (each containing four integers). The size of the file would shoot up beyond 50GB.

I am testing my code with 400 million records (about 6 GB file). My disk configuration looks like this:

disk=/var/tmp/stxxl,50G,syscall delete

My machine has 16 GB RAM and has 8 physical processors (Intel i7), stxxl version 1.4.1. If I run the code with 200 million records, it takes about 5 minutes. But when I run the code with 400 million records, it seems to be running out of disk space. My questions are:

1) Why is my code running out of disk space for sorting even a 6 GB file? Kindly review it (only few important lines attached).

2) Is 5 minute a reasonable time for my PC to sort 200 million records? If its true, I wonder if stxxl can sort 5 billion records within daytime.

3) Do you think stxxl is a nice choice for this sort of problem? I have access to a cluster as well which has mpi installed.

CODE (inspired by examples/algo/sort_file.cpp and examples/algo/phonebills.cpp):

size_t memory_to_use = (1*1024) * 1024 * 1024ul;
typedef stxxl::vector<my_type, 1, stxxl::lru_pager<8>, block_size> vector_type;

std::copy(std::istream_iterator<my_type>(in),
  std::istream_iterator<my_type>(),
  std::back_inserter(v));

stxxl::sort(v.begin(), v.end(), Cmp(), memory_to_use);

Each vector element or record is a tuple of four unsigned numbers:

struct my_type
{
  typedef unsigned short key_type;
  typedef std::tuple<key_type, key_type, key_type, key_type> key4tuple;
  ...
}

Solution

  • If you only want to sort, consider using the stxxl::sorter.

    It should require only the expected amount of disk space, the total size of your data, and should sorting with at least ~100 MB/s, depending on your disk(s) and how complicated comparisons are relative to the data type size.

    The stxxl::sort() function does more work and needs extra space, since it writes temporary extra data.

    Also see my tutorial video :).