Search code examples
c++arraysperformancemergearray-merge

Filter Lists for Repeats - List Deduplication


I have some lists of words spread across several files and I need a quick way to merge them all into one file. I'd like to remove duplicates when merging so the final list does not contain multiple instances of a single word.

Example:

The text file list_a.txt says the following:

apple
pear
peach

The text file list_b.txt says the following:

foo
bar
peach
car

When merged, the output file should say:

apple
pear
peach
foo
bar
car

Note that both list_a.txt and list_b.txt contained the word peach, but it only appeared once in the output file.

Here's the code I was using:

int main()
{
    string myList = "";
    string myFiles[] = {"list_a.txt", "list_b.txt"};
    string line;
    int iterationsSinceSleep = 0;
    size_t length = sizeof(myFiles)/sizeof(myFiles[0]);
    for(unsigned int i = 0; i < length; i++){
        cout<<"Starting " << myFiles[i] << endl;
        ifstream myfile((string("C:/words/").append(myFiles[i])).c_str());
        if (myfile.is_open())
        {
            while ( getline (myfile,line) ){
                string trimmedLine = trim(line);
                if(myList.find(trimmedLine) == string::npos){
                    myList.append(trimmedLine + '\n');
                }

                iterationsSinceSleep++;
                iterationsSinceSleep %= 10000;
                // Save the CPU!
                if(iterationsSinceSleep == 0) Sleep(10);

            }
            myfile.close();
        }else{
            cout << "Could not open & process " << myFiles[i] << endl;
        }
        Sleep(75); // Save the CPU!
        iterationsSinceSleep = 0;
    }

    // write to the file
    ofstream myfile ("C:/words/merged/final.txt");
    if (myfile.is_open())
    {
        cout<<"Writing filtered list"<<endl;
        myfile << myList;
        myfile.flush();
        myfile.close();
    }else{
        cout<<"Could not save filtered list."<<endl;
    }



    return 0;
}

This works fine for smaller lists/files, but one of my lists has a few million lines.

I need a way to make this code run well even if it has to process several files with millions of lines.

My first idea for improving this is to use either an array or vector instead of a string for storing the unique lines. However, both of these methods have advantages and disadvantages.

Pros of using an array:

  • Faster comparison checking (I think)
  • Faster element accessing

Cons of using an array:

  • Reallocing to insert new strings might be slow
  • The program must keep track of the array's length (not a big concern, but a factor)

Pros of using a vector:

  • Dynamically add elements
  • Built in searching function

Cons of using a vector:

  • Inserting elements to a vector is slow (so I've read)
  • I would imagine the vector has more overhead.

Can anybody offer suggestions to improve this code and write it more efficiently? Speed is a major concern, but I also need to consider memory consumption.

Thank you in advance.


Solution

  • Use a std::set. A set does not allow duplicate entries. Try something like:

    std::set<std::string> mySet;
    ...
    mySet.insert(trimmedString);
    ...
    for (auto &&str : mySet)
       myFile << str;
    

    Note: I just typed this here, so may have some typos.

    Also Note: this will sort the output, not sure if you want that or not.