I have some lists of words spread across several files and I need a quick way to merge them all into one file. I'd like to remove duplicates when merging so the final list does not contain multiple instances of a single word.
Example:
The text file list_a.txt
says the following:
apple
pear
peach
The text file list_b.txt
says the following:
foo
bar
peach
car
When merged, the output file should say:
apple
pear
peach
foo
bar
car
Note that both list_a.txt
and list_b.txt
contained the word peach, but it only appeared once in the output file.
Here's the code I was using:
int main()
{
string myList = "";
string myFiles[] = {"list_a.txt", "list_b.txt"};
string line;
int iterationsSinceSleep = 0;
size_t length = sizeof(myFiles)/sizeof(myFiles[0]);
for(unsigned int i = 0; i < length; i++){
cout<<"Starting " << myFiles[i] << endl;
ifstream myfile((string("C:/words/").append(myFiles[i])).c_str());
if (myfile.is_open())
{
while ( getline (myfile,line) ){
string trimmedLine = trim(line);
if(myList.find(trimmedLine) == string::npos){
myList.append(trimmedLine + '\n');
}
iterationsSinceSleep++;
iterationsSinceSleep %= 10000;
// Save the CPU!
if(iterationsSinceSleep == 0) Sleep(10);
}
myfile.close();
}else{
cout << "Could not open & process " << myFiles[i] << endl;
}
Sleep(75); // Save the CPU!
iterationsSinceSleep = 0;
}
// write to the file
ofstream myfile ("C:/words/merged/final.txt");
if (myfile.is_open())
{
cout<<"Writing filtered list"<<endl;
myfile << myList;
myfile.flush();
myfile.close();
}else{
cout<<"Could not save filtered list."<<endl;
}
return 0;
}
This works fine for smaller lists/files, but one of my lists has a few million lines.
I need a way to make this code run well even if it has to process several files with millions of lines.
My first idea for improving this is to use either an array or vector instead of a string for storing the unique lines. However, both of these methods have advantages and disadvantages.
Pros of using an array:
Cons of using an array:
Pros of using a vector:
Cons of using a vector:
Can anybody offer suggestions to improve this code and write it more efficiently? Speed is a major concern, but I also need to consider memory consumption.
Thank you in advance.
Use a std::set
. A set does not allow duplicate entries. Try something like:
std::set<std::string> mySet;
...
mySet.insert(trimmedString);
...
for (auto &&str : mySet)
myFile << str;
Note: I just typed this here, so may have some typos.
Also Note: this will sort the output, not sure if you want that or not.