Searching for Holy Grail of search and replace in C++

Recently I was looking for a way to replace tokens in string which is essentially find and replace (but there is at least one additional approach to the problem) and looks like quite banal task. I've came with couple of possible implementations but none of them was satisfying from performance point of view. Best achievement was ~50us per iteration. The case was ideal, the string was never growing in size and initially I've omitted the requirement to be case insensitive
Here is the code at Coliru

Results from my machine:
Boost.Spirit symbols result: 3421?=3421
100000 cycles took 6060ms.
Boyer-Moore results:3421?=3421
100000 cycles took 5959ms.
Boyer Moore Hospool result:3421?=3421
100000 cycles took 5008ms.
Knuth Morris Pratt result: 3421?=3421
100000 cycles took 12451ms.
Naive STL search and replace result: 3421?=3421
100000 cycles took 5532ms.
Boost replace_all result:3421?=3421
100000 cycles took 4860ms.

So the question, what takes so long in such a simple task? One can say, ok, simple task, go ahead and implement it better. But the reality is that 15 years old MFC naive implementation does the task order of magnitude faster:

CString FillTokenParams(const CString& input, const std::unordered_map<std::string, std::string>& tokens)
{
    CString tmpInput = input;
    for(const auto& token : tokens)
    {
        int pos = 0;
        while(pos != -1)
        {
            pos = tmpInput.Find(token.first.c_str(), pos);
            if(pos != -1)
            {
                int tokenLength = token.first.size();
                tmpInput.Delete(pos, tokenLength);
                tmpInput.Insert(pos, token.second.c_str());
                pos += 1;
            }
        }
    }

    return tmpInput;
}

result:
MFC naive search and replace result:3421?=3421
100000 cycles took 516ms.
How come this clumsy code outperforms modern C++??? why other implementations were so slow? Am I missing something fundamental?

EDIT001: I've invested in this question, the code been profiled and triple checked. You can be unsatisfied with this and that, but std::string::replace is not what taking time. In any STL implementation search is what taking most of the time, boost spirit wastes time on allocation of tst (test node in evaluation tree I guess). I dont expect anyone pointing on a line in a function with "this is your problem" and poof, the problem is gone. The question is how did MFC manages to do the same work 10 times faster.

EDIT002: Just drilled down into MFC implementation of Find and wrote a function which mimics the MFC implementation

namespace mfc
{
std::string::size_type Find(const std::string& input, const std::string& subString, std::string::size_type start)
{
    if(subString.empty())
    {
        return std::string::npos;
    }

    if(start < 0 || start > input.size())
    {
        return std::string::npos;
    }

    auto found = strstr(input.c_str() + start, subString.c_str());
    return ((found == nullptr) ? std::string::npos : std::string::size_type(found - input.c_str()));
}
}

std::string MFCMimicking(const std::string& input, const std::unordered_map<std::string, std::string>& tokens)
{
    auto tmpInput = input;
    for(const auto& token : tokens)
    {
        auto pos = 0;
        while(pos != std::string::npos)
        {
            pos = mfc::Find(tmpInput, token.first, pos);
            if(pos != std::string::npos)
            {
                auto tokenLength = token.first.size();
                tmpInput.replace(pos, tokenLength, token.second.c_str());
                pos += 1;
            }
        }
    }

    return tmpInput;
}

Results:
MFC mimicking expand result:3421?=3421
100000 cycles took 411ms.
Meaning 4us. per call, go beat that C strstr

EDIT003: Compiling and running with -Ox

MFC mimicking expand result:3421?=3421
100000 cycles took 660ms.
MFC naive search and replace result:3421?=3421
100000 cycles took 856ms.
Manual expand result:3421?=3421
100000 cycles took 1995ms.
Boyer-Moore results:3421?=3421
100000 cycles took 6911ms.
Boyer Moore Hospool result:3421?=3421
100000 cycles took 5670ms.
Knuth Morris Pratt result: 3421?=3421
100000 cycles took 13825ms.
Naive STL search and replace result: 3421?=3421
100000 cycles took 9531ms.
Boost replace_all result:3421?=3421
100000 cycles took 8996ms.

running with -O2 (as in original measurements) but 10k cycles

MFC mimicking expand result:3421?=3421
10000 cycles took 104ms.
MFC naive search and replace result:3421?=3421
10000 cycles took 105ms.
Manual expand result:3421?=3421
10000 cycles took 356ms.
Boyer-Moore results:3421?=3421
10000 cycles took 1355ms.
Boyer Moore Hospool result:3421?=3421
10000 cycles took 1101ms.
Knuth Morris Pratt result: 3421?=3421
10000 cycles took 1973ms.
Naive STL search and replace result: 3421?=3421
10000 cycles took 923ms.
Boost replace_all result:3421?=3421
10000 cycles took 880ms.

Solution

Ok, it will be a long story. Just to remind you questions asked.

Why the search and replace using C++ (various approaches) so slow?
Why the MFC search and replace so fast?

Surprisingly, both questions have the same answer. Because of C++ overhead. Yep. Our shiny modern C++ has an overhead which mostly we dismiss for the flexibility and elegance.

However, when it comes to sub-microsecond resolutions (not that C++ is not capable of doing things in nanosecond resolutions) the overhead becomes more prominent.

Let me showcase with the same code I've posted in the question, but it is more coherent with things done in each function.

Live On Coliru

It uses aforementioned Nonius (thanks to @sehe) and interactive results are here.

You can click the legend to show/hide particular series.

Conclusions

There are two outstanding results

MFC mimicking function and
my own manual replace

These functions at least one order of magnitude are faster than the rest, so what is the difference?

All these "slow" functions are written in C++ when the fast is written in C (not pure C, I was too lazy to deal with malloc/realloc of output buffers when the output grows in size). Well, I guess it is clear that sometimes there is no choice but resort to pure C. I personally against using C for security reasons and lack of type safety. In addition it simply requires more expertise and attention to write high quality C code.

I will not mark it as an answer for a while, waiting for comments on this conclusion.

I want to thank all those who actively participated in the discussion, raised ideas and pointed out inconsistencies in my example.

2019 update:
Just to preserve the code: https://github.com/kreuzerkrieg/string_search_replace
Nonius results: https://kreuzerkrieg.github.io/string_search_replace/

Ran with gcc-9 on Ubuntu 18.04