Search code examples
c++boost-spiritboost-spirit-qi

Using Boost.Spirit to extract certain tags/attributes from HTML


So I've been learning a bit about Boost.Spirit to replace the use of regular expressions in a lot of my code. The main reason is pure speed. I've found Boost.Spirit to be up to 50 times faster than PCRE for some relatively simple tasks.

One thing that is a big bottleneck in one of my apps is taking some HTML, finding all "img" tags, and extracting the "src" attribute.

This is my current regex:

(?i:<img\s[^\>]*src\s*=\s*[""']([^<][^""']+)[^\>]*\s*/*>)

I've been playing around with it trying to get something to work in Spirit, but so far I've come up empty. Any tips on how to create a set of Spirit rules that will accomplish the same thing as this regex would be awesome.


Solution

  • Out of curiosity I redid my regex sample based on Boost Xpressive, using statically compiled regexes:

    sehe@natty:/tmp$ time ./expressive < bench > /dev/null
    
    real    0m2.146s
    user    0m2.110s
    sys 0m0.030s
    

    Interestingly, there is no discernable speed difference when using the dynamic regular expression; however, on the whole the Xpressive version performs better than the Boost Regex version (by roughly 10%)

    What is really nice, IMO, is that it was really almost matter of including the xpressive.hpp and changing a few namespaces around to change from Boost Regex to Xpressive. The API interface (as far as it was being used) is exactly the same.

    The relevant code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)

    typedef std::string::const_iterator It;
    
    int main(int argc, const char *argv[])
    {
        using namespace boost::xpressive;
    #if DYNAMIC
        const sregex re = sregex::compile
             ("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");
    #else
        const sregex re = "<img" >> +_s >> -*(~(set = '\\','>')) >> 
            "src" >> *_s >> '=' >> *_s
            >> (s1 = as_xpr('"') | '\'') >> (s2 = -*_) >> s1;
    #endif
    
        std::string s;
        smatch what;
    
        while (std::getline(std::cin, s))
        {
            It f = s.begin(), l = s.end();
    
            do
            {
                if (!regex_search(f, l, what, re))
                    break;
    
                handle_attr("img", "src", what[2]);
                f = what[0].second;
            } while (f!=s.end());
        }
    
        return 0;
    }