Search code examples
htmlc++regextoken

Extract HTML comments using C++ std::sregex_token_iterator


I'm trying to extract the comments section from HTML source. It is sort of working but not quite.

<html><body>Login Successful!</body><!-- EXTRACT-THIS --></html>

Here's my code so far:

#include <string>
#include <iostream>
#include <sstream>
#include <fstream>
#include <regex>

using namespace std;

int main()
{
    string s = 
    "<html><body>Login Successful!</body><!-- EXTRACT-THIS --></html>";

    // Regular expression to extract from HTML comment 
    // <!-- comment -->
    regex  r("[<!--\r\n\t][\r\n\t-->]");

    for (sregex_token_iterator it = sregex_token_iterator(
                                        s.begin(), 
                                        s.end(), 
                                        r, 
                                        -1); 
         it != sregex_token_iterator(); ++it)
    {
        cout << "TOKEN: " << (string) *it << endl;
    }

    return 0;
}

I guess my main question is that is there a way to improve my regex expression?


Solution

  • Let's start with a std::string that contains more than one comment section:

    string s = "<html><body>Login Successful!</body><!-- EXTRACT-THIS --><p>Test</p><!-- XXX --></html>";
    

    Removing the Comments and Printing the HTML tags

    If you want to remove the HTML comments from this string, you can do it like this:

    regex r("(<\\!--[^>]*-->)");
    
    // split the string using the regular expression
    sregex_token_iterator iterator = sregex_token_iterator(s.begin(), s.end(), r, -1);
    sregex_token_iterator end;
    for (; iterator != end; ++iterator)
    {
        cout << "TOKEN: " << (string) *iterator << endl;
    }
    

    This code prints:

    TOKEN: <html><body>Login Successful!</body>
    TOKEN: <p>Test</p>
    TOKEN: </html>
    

    Removing the HTML Tags and Printing the Comments

    If you want to extract the comments from the string, you can use the std::sregex_iterator like this:

    regex r("(<\\!--[^>]*-->)");
    
    std::sregex_iterator next(s.begin(), s.end(), r);
    std::sregex_iterator end;
    while (next != end) {
        std::smatch match = *next;
        std::cout << match.str() << "\n";
        next++;
    }
    

    This code prints:

    <!-- EXTRACT-THIS -->
    <!-- XXX -->
    

    Parsing Comment Tags Manually

    Another option is to find and iterate through the opening and closing tags manually. We can use the std::string::find() and std::string::substr() methods:

    const std::string OPEN_TAG = "<!--";
    const std::string CLOSE_TAG = "-->";
    
    auto posOpen = s.find(OPEN_TAG, 0);
    while (posOpen != std::string::npos) {
        auto posClose = s.find(CLOSE_TAG, posOpen);
        std::cout << s.substr(posOpen, posClose - posOpen + CLOSE_TAG.length()) << '\n';
        posOpen = s.find(OPEN_TAG, posClose + CLOSE_TAG.length());
    }