This Github repository added std::regex to the list of regular expression engines and got decimated by the others.
Why is that std::regex - as implemented in libstdc++ - so much slower than others? Is that because of the C++ standard requirements or it is just that that particular implementation is not very well optimized?
Also in the shootout std::regex was unable to compile several regular expressions that all the others accepted, even after adding the flag std::regex::extended. They were (?i)Twain
, \b\w+nn\b
, (?i)Tom|Sawyer|Huckleberry|Finn
, \s[a-zA-Z]{0,12}ing\s
, ([A-Za-z]awyer|[A-Za-z]inn)\s
and \p{Sm}
.
UPDATE: Added comparison with boost::regex.
UPDATE2: added ctre
Is that because of the C++ standard requirements or it is just that that particular implementation is not very well optimized?
The answer is yes. Kinda.
There is no question that libstdc++'s implementation of <regex>
is not well optimized. But there is more to it than that. It's not that the standard requirements inhibit optimizations so much as the standard requirements inhibit changes.
The regex library is defined through a bunch of templates. This allows people to choose between char
and wchar_t
, which is in theory good. But there's a catch.
Template libraries are used by copy-and-pasting the code directly into the code compiling against those libraries. Because of how templates get included, even types that nobody outside of the template library knows about are effectively part of the library's ABI. If you change them, two libraries compiled against different versions of the standard library cannot work with each other. And because the template parameter for regex is its character type, those implementation details touch basically everything about the implementation.
The minute libstdc++ (and other standard library implementations) started shipping an implementation of C++ regular expressions, they bound themselves to a specific implementation that could not be changed in a way that impacted the ABI of the library. And while they could cause another ABI break to fix it, standard library implementers don't like breaking ABI because people don't upgrade to standard libraries that break their code.
When C++11 forbade basic_string
copy-on-write implementations, libstdc++ had an ABI problem. Their COW string was widely used, and changing it would make code that compiled against the new one break when used with code compiled against the old one. It took years before libstdc++ bit the bullet and actually implemented C++11 strings.
If Regex had been defined without templates, implementations could use traditional mechanisms to hide implementation details. The ABI for the interface to external code could be fixed and unchanging, with only the implementation of the functions behind that ABI changing from version to version.