Search code examples
htmlregexphrases

Mixed results with perl regex, matching list of phrases in html code


Mixed results with regex, matching list of phrases in html code

This new post was in response to another post, Perl Regex match lines that contain multiple words, but was, for reasons unknown to me, deleted by the moderator. It seemed logical to me to ask the question in the original thread because it has to do with an attempt to use the solution given early on in that thread, and a problem with it. There was a generic reference to the faq, which didn't seem to reveal any discrepancies, and the message, "If you have a question, please post your own question." Hence this post.

I am using LWP::Simple to get a web page and then trying to match lines that contain certain phrases. I copied the regex in answer #1 in the above-mentioned thread, and replaced/added words that I need to match, but I am getting mixed results with two similar but different web pages.

The regex I am using is:

/^(?=.*?\bYear\b)(?=.*?\bNew Moon\b)(?=.*?\bFirst Quarter\b)(?=.*?\bFull Moon\b)(?=.*?\bLast Quarter\b).*$/gim

For web site #1, which has bare lines containing these words, in a series of blocks surrounded by <pre>..</pre> tags, it matches all lines exactly equal to this one, as expected:

 Year        New Moon       First Quarter       Full Moon       Last Quarter

BUT for web site #2, which has nasty little tags surrounding the words:

<br><br><span class="prehead"> Year      New Moon       First Quarter       Full Moon       Last Quarter          &#916;T</span><br>

it matches EVERY line!

I'm sure the <span> tags are the "proper" way to do this but I am wondering how to get around those tags so I can have just one regex for both sites. Is there a simple way to do this or do I have to learn how to parse html (something I'd rather not have to do)?

I'm looking for a quick solution, not a robust one. This is probably a one-time-only deal. If these relatively static pages change, it will probably be minor and easy to fix. Please don't refer me to all the 'anti-regex-for-html' pages. I've seen 'em. And please don't make me use HTML::TreeBuilder. Oh please...


Solution

  • I finally got this working for both urls using the original regex by looping through the retrieved html document directly:

    for my $line (split qr/\R/, $doc)
    {
        next unless $line =~ /^(?=.*?\bYear\b)(?=.*?\bNew Moon\b)(?=.*?\bFirst Quarter\b)(?=.*?\bFull Moon\b)(?=.*?\bLast Quarter\b).*$/gim; # original
        print "$line\n";
    }
    

    It really shouldn't be this difficult. ;-)