I have a large vector (mainvect) of struct info objects (about 8million element), and I want to remove duplicates, the struct consist of pid, and uid .
struct info
{
int pid;
string uid;
}
I have another vector (vect1) which contain information of each pid and its occurrence in mainvect (its help in search specific indices not all the main vect) size of vect1 is 420k elements
struct pidInfo
{
int pid;
int numofoccurence;
}
I want to store unqiue elements in mainvect in vect2.
.
.
// sort mainvect based on pid
sort(mainvect.begin(), mainvect.end(), sortByPId());
int start = 0;
int end = 0;
vector <string> temp; // to store uids with a specific pid
for (int i = 0; i < vect1.size(); i++)
{
end = end + vect1[i].numofoccurence;
for (int j = start; j < end; j++)
{
temp.push_back(mainvect[j].uid);
}
start = start + vect1[i].numofoccurence;
// remove duplicate uid
sort(temp.begin(), temp.end());
temp.erase(unique(temp.begin(), temp.end()), temp.end());
// push remaining unique uids
for (int k = 0; k < temp.size(); k++)
{
info obb;
obb.pid = vect1[i].pid;
obb.uid = temp[k];
vect2.push_back(obb);
}
// empty the temp vector to use in next i iteration
temp.erase(temp.begin(), temp.end());
}
.
.
But when I run the code, it gave me exception as shown in the following figure
I think you actually have algorithm problem. On each iteration you are sorting and leaving only unique elements intemp
vector. But with this approach each iteration will add more and more duplicates into vect2
. So you should sort and leave only unique elements in vect2
as well. Actually it would be probably better to utilize std::set
instead of temp
and vect2
.
Another suggestion would be to utilize a better storage for uid if it has some sort of fixes-length format, such as GUID.