So I've been learning a bit about Boost.Spirit to replace the use of regular expressions in a lot of my code. The main reason is pure speed. I've found Boost.Spirit to be up to 50 times faster than PCRE for some relatively simple tasks.
One thing that is a big bottleneck in one of my apps is taking some HTML, finding all "img" tags, and extracting the "src" attribute.
This is my current regex:
(?i:<img\s[^\>]*src\s*=\s*[""']([^<][^""']+)[^\>]*\s*/*>)
I've been playing around with it trying to get something to work in Spirit, but so far I've come up empty. Any tips on how to create a set of Spirit rules that will accomplish the same thing as this regex would be awesome.
Out of curiosity I redid my regex sample based on Boost Xpressive, using statically compiled regexes:
sehe@natty:/tmp$ time ./expressive < bench > /dev/null
real 0m2.146s
user 0m2.110s
sys 0m0.030s
Interestingly, there is no discernable speed difference when using the dynamic regular expression; however, on the whole the Xpressive version performs better than the Boost Regex version (by roughly 10%)
What is really nice, IMO, is that it was really almost matter of including the
xpressive.hpp
and changing a few namespaces around to change from Boost Regex to Xpressive. The API interface (as far as it was being used) is exactly the same.
The relevant code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)
typedef std::string::const_iterator It;
int main(int argc, const char *argv[])
{
using namespace boost::xpressive;
#if DYNAMIC
const sregex re = sregex::compile
("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");
#else
const sregex re = "<img" >> +_s >> -*(~(set = '\\','>')) >>
"src" >> *_s >> '=' >> *_s
>> (s1 = as_xpr('"') | '\'') >> (s2 = -*_) >> s1;
#endif
std::string s;
smatch what;
while (std::getline(std::cin, s))
{
It f = s.begin(), l = s.end();
do
{
if (!regex_search(f, l, what, re))
break;
handle_attr("img", "src", what[2]);
f = what[0].second;
} while (f!=s.end());
}
return 0;
}