Efficiently uniquize lines of huge text file in C++

Suppose I have very huge text file, with lines of different arbitrary lengths. I want to remove duplicate lines, how do I do this in C++?

Equal duplicates can be apart on very large distance. And I want to leave only first occurance.

File is so huge that it can be even 10-50 times bigger than size of RAM.

There is Linux command uniq, but it removes only adjacent equal lines. While I need removing any far apart duplicates.

Originally this question was asked here, but now it is deleted. It was failed to be UnDeleted or ReOpen. And I already implemented half of huge answer to it when it was deleted.

I'm asking this question only to share my own answer below, yet I'm providing below a very tiny solution, which doesn't scale because it uses only in-memory unordered set.

Simplest in-memory only solution using std::unordered_set:

Try it online!

#include <random>
#include <iostream>
#include <unordered_set>
#include <string>
#include <fstream>

int main() {
    size_t constexpr n = 15;
    std::mt19937_64 rng{125};
    {
        std::ofstream f("test.txt");
        std::cout << "Input:" << std::endl;
        for (size_t i = 0; i < n; ++i) {
            auto const x = rng() % (n * 3 / 4);
            f << x << std::endl;
            std::cout << x << std::endl;
        }
    }
    std::ofstream fw("test.txt.out");
    std::ifstream f("test.txt");
    std::string line;
    std::unordered_set<std::string> set;
    std::cout << std::endl << "Output:" << std::endl;
    while (std::getline(f, line)) {
        if (set.count(line))
            continue;
        fw << line << std::endl;
        std::cout << line << std::endl;
        set.insert(line);
    }
}

Console Output:

Input:
2
10
6
10
7
6
3
2
6
2
3
7
8
1
10

Output:
2
10
6
7
3
8
1

Note. Although in above code I used only very short lines, in my real files lines can be of any length. In other words uniquized set of all lines exceeds memory size many times. This means in-memory only solutions are not suitable.

Solution

I implemented below from scratch 3 totally different solutions of OP's task:

My own implementation of disk-based hash set. It stores 80-96-bit hash of each line. It solves task the way similarly to code in Question of OP. This set is similar to absl::flat_hash_set by design of its algorithm. This algorithm is 2nd fastest of three.
Sorting based algorithm, which sorts hashes of all lines in portions equal to free memory size, using std::sort(), then merges them all using N-Way Merge Sort. This algorithm is 1st fastest among three.
Implemented disk-based B-Tree from scratch, which supports any arbitrary branching degree. This tree keeps sorted hashes of all lines, this way it allows to exclude duplicates. This is slowest out of three algorithms.

I'll provide some details about all algorithms:

Disk based hash set uses single huge file that stores entries equal to pairs of value and partial hash. Partial hash stored in entry consists of high bits of hash of line. Lower bits of hash are stored indirectly as index of bucket.

This hash set is similar to absl::flat_hash_set from ABSEIL library.

Similar in a sence that it stores part of higher bits of hash near value inside bucket. When hash set searches for existing value it first reads a bucket from disk, where bucket index is equal to hash % hash_table_size.

After bucket is read from disk it is checked if hash of searched key has same higher bits. If so, value is checked if its key is equal to searched one. If not, then following few buckets are read from disk (actually they are cached from previous read), and checked same way. If after reading we meet empty bucket then we conclude that there is no searched key.

To add value to hash set we search for key as described above. Then write key/value to first empty bucket.

Read and write in hash set is done through random read and write on disk. It is optimal if we use SSD instead of HDD, because then random read and write will be very fast.

Of course even on SSD if you do writing then SSD writes 8KB at a time, even if you change only 16 bytes. Because SSD flash cluster size is 8KB. Although reading is fast. Hence this disk hash set is not too fast.

This algorithm is 2nd fastest among three my algorithms.

Second sorting based algorithm does following.

First it accumulates into memory as many hashes of lines of text files as possible, as far as there is free memory. Then sorts it in-memory through std::sort using std::execution::par_unseq which allows to sort in-memory array in multi-threaded fasion.

Then sorted in-memory portion of hashes is stored to disk into first file. Another portion of in-memory hashes is sorted and stored into second file. And so on we continue doing this till we store all possible hashes into many files.

Together with hashes in each entry of sorted file we keep index of line of source file. We sort these pairs.

Then we merge all files using N-Way Merge Sort, to do this I utilize so called Heap, which is emulated in C++ through std::make_heap() and std::push_heap() and std::pop_heap().

Merged sequence of pairs is stored into one huge file.

After sorting of pairs is done, then we uniquize pairs by sequentially scanning them and removing adjacent pairs which have duplicate hashes but different lines indices, then we keep only first index of line. This uniquized file is stored into another one huge file. We store only second elements of pairs, i.e. indices of lines.

Uniquized file is then sorted again. To remind, it contains only indices of lines. Sorting is done in the way as described at start of this algorithm, meaning that we split into many in-memory files, sort them, and N-Way Merge Sort them into single huge file.

Finally, last uniquized and sorted huge file we sequentially scan together with scanning of original text file. Scanning them in pair we do 2-way merge, meaning that we skip lines with absent indices.

Done. Now our original text file has only unique lines.

Third algorithm is based on B-Tree, which allows to store any sequence of sorted elements. In our case we store sorted hashes of lines.

B-Tree is quite complex to explain, better to read Wiki Article.

In short B-Tree is constructed following way. Tree has branching degree equal to some B, lets say B = 10, it means at most 10 children has each intermediate node except leaves.

Each intermediate node has pointers to 10 children plus 10 smallest keys of a child subtree.

If we insert into tree something then from root we descend down to leaves, while on the way we check if searched key is contained in child sub-tree. It is contained in child sub-tree only if left child has smaller or equal key, while right child has bigger key.

Now we add new entry to leaf. If leaf overflows in size, i.e. contains more than 10 elements, then it is split into two nodes of equal number of elements. Then inside its parent instead of single pointer to child we try to add two pointers to children. This parent child count may overflow 10 elements, then we split it into two equal nodes too.

Same way all nodes on the way from leaf to root may be split if necessary. Until we meet a node that has less than 10 pointers, then we don't need to split it and process finishes.

Also till root we need to update new smallest sub-tree key, because it may have changed if inserted into leaf value was at the first position.

If we need to search in a tree, then we do same as described above, meaning that we search from root till leaf for given key. Value inside a leaf either contains our key, then we found something, or non-equal key, then we didn't find a key.

In my B-Tree algorithm I didn't implement deletion, only insertion or search. Deletion is more complex, but is not needed for our task, it can be implemented later in our spare time if needed.

This 3rd algorithm is slowest, because it does around 4-5 random SSD reads and writes on every added or read element.

Now, below I present whole C++ code that implements all 3 algorithms. This code should be compiled in GCC, also CLang can compile it. Right now it is not suitable for MSVC, but I can probably tweak to support MSVC too, only because MSVC doesn't have __int128 type that GCC/CLang have.

This program is fully self contained, just compile it and run. It runs speed tests of single operations, plus runs full tests on pre-generated data. You may change nums_cnt = 1ULL << 17; to some bigger value if you need to generate more lines of text. nums_cnt signifies how many lines are there.

Try it online! (GodBolt)

SOURCE CODE HERE. Post together with full code is so large that it can't fit 30 000 symbols limit of StackOverflow post size, code alone is 46 KB in size. Hence I provide code separately inside GitHub Gist link below. Also you may click on Try it online! above, this will redirect you to code in GodBolt server, you may try it live there.

GitHub Gist full code

Console Output:

Insert Time HS  174.1 ns/num, BT 5304.9 ns/num, UM  278.5 ns/num, Boost HS 1.599x, Boost BT 0.052x
Erase  Time HS  217.4 ns/num,                   UM  286.9 ns/num, Boost HS 1.320x
Find   Time HS  113.7 ns/num, BT 1954.6 ns/num, UM   61.8 ns/num, Boost HS 0.544x, Boost BT 0.032x 

Algo HashSet  61.34 sec
Algo Sort     13.8  sec
Algo BTree   256.9  sec