Multithreading is slower than no threading C++

I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.

I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:

Non-Multithread Case

std::vector<hobj> validobjs;
int length = 70;

for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
    if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
        hobj obj(*i, length);
        validobjs.push_back(hobj);
    }
}

for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
    if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
        hobj obj(*j, length);
        validobjs.push_back(hobj);
    }
}

Multithread Case

std::vector<hobj> validobjs;
int length = 70;

#pragma omp parallel
{
    std::vector<hobj> threaded1;   // Each thread has own local vector
    #pragma omp for nowait firstprivate(length)
    for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
        if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
            hobj obj(*i, length);
            threaded1.push_back(obj);
        }
    }

    std::vector<hobj> threaded2;  // Each thread has own local vector
    #pragma omp for nowait firstprivate(length)
    for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
        if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
            hobj obj(*j, length);
            threaded2.push_back(obj);
        }
    }

    #pragma omp critical  // Insert local vectors to main vector one thread at a time
    {
        validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
        validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
    }
}

In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).

I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.

So questions:

1) Is this kind of operation suited for multi-threading?

2) If yes to 1) - does the multithreaded code look reasonable?

3) Is there anything I can do to improve the performance to get it faster than the no-thread case?

Additional info:

The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4

Cheers!

Solution

I'm no expert on multithreading, but I'll give it a try:

Is this kind of operation suited for multi-threading?

I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.

As is the merging at the end of the threaded version.

Does the multithreaded code look reasonable?

I think you'r on the right path to let each thread work on independent data.

Is there anything I can do to improve the performance to get it faster than the no-thread case?

I see a few points that might improve performance:

The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
```
validobjs.reserve(v1.size() + v2.size());
```
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
1. Filter the data in two threads as you do now.
2. Synchronise them and allocate a vector with a number of elements:
```
validobjs.resize(v1.size() + v2.size());
```
3. Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
  
  Allthough I'm not sure if this is entirely legal or if it is undefined behaviour

You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.

Those were my thoughts on this, I hope there was something usefull in it.