I'm trying to extract the comments section from HTML source. It is sort of working but not quite.
<html><body>Login Successful!</body><!-- EXTRACT-THIS --></html>
Here's my code so far:
#include <string>
#include <iostream>
#include <sstream>
#include <fstream>
#include <regex>
using namespace std;
int main()
{
string s =
"<html><body>Login Successful!</body><!-- EXTRACT-THIS --></html>";
// Regular expression to extract from HTML comment
// <!-- comment -->
regex r("[<!--\r\n\t][\r\n\t-->]");
for (sregex_token_iterator it = sregex_token_iterator(
s.begin(),
s.end(),
r,
-1);
it != sregex_token_iterator(); ++it)
{
cout << "TOKEN: " << (string) *it << endl;
}
return 0;
}
I guess my main question is that is there a way to improve my regex expression?
Let's start with a std::string
that contains more than one comment section:
string s = "<html><body>Login Successful!</body><!-- EXTRACT-THIS --><p>Test</p><!-- XXX --></html>";
If you want to remove the HTML comments from this string, you can do it like this:
regex r("(<\\!--[^>]*-->)");
// split the string using the regular expression
sregex_token_iterator iterator = sregex_token_iterator(s.begin(), s.end(), r, -1);
sregex_token_iterator end;
for (; iterator != end; ++iterator)
{
cout << "TOKEN: " << (string) *iterator << endl;
}
This code prints:
TOKEN: <html><body>Login Successful!</body>
TOKEN: <p>Test</p>
TOKEN: </html>
If you want to extract the comments from the string, you can use the std::sregex_iterator
like this:
regex r("(<\\!--[^>]*-->)");
std::sregex_iterator next(s.begin(), s.end(), r);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << "\n";
next++;
}
This code prints:
<!-- EXTRACT-THIS -->
<!-- XXX -->
Another option is to find and iterate through the opening and closing tags manually. We can use the std::string::find()
and std::string::substr()
methods:
const std::string OPEN_TAG = "<!--";
const std::string CLOSE_TAG = "-->";
auto posOpen = s.find(OPEN_TAG, 0);
while (posOpen != std::string::npos) {
auto posClose = s.find(CLOSE_TAG, posOpen);
std::cout << s.substr(posOpen, posClose - posOpen + CLOSE_TAG.length()) << '\n';
posOpen = s.find(OPEN_TAG, posClose + CLOSE_TAG.length());
}