Search code examples
htmlperlscreen-scraping

I'm new to Perl and have a few regex questions


I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:

 <dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
 <dd>
   <p>
     [Content]
   </p>
 </dd>
 ... and so on.

and here's the example script I'm studying:

#!/usr/bin/perl -w

use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;

my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);

$rss->channel(title       => "The more accurate diary. Really.",
          link        => $url,
          description => "Telsa's diary of life with a hacker:" 
                 . " the current ramblings");

foreach (split ('<dt>', $page))
{
if (/<a\sname="
         ([^"]*)     # Anchor name
         ">
         <strong>
         ([^>]*)     # Post title
         <\/strong><\/a><\/dt>\s*<dd>
         (.*)        # Body of post
         <\/dd>/six)
{
    $rss->add_item(title       => $2,
               link        => "$url#$1",
                   description => encode_entities($3));
}
}

If you have a moment to better help me understand, my questions are:

  1. how does the following line work:

    ([^"]*) # Anchor name

  2. how does the following line work:

    ([^>]*) # Post title

  3. what does the "six" mean in the following line:

    </dd>/six)

Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!


Solution

  • how does the following line work...

    ([^"]*) # Anchor name

    zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.

    how does the following line work...

    ([^>]*) # Post title

    zero or more things which aren't >, captured as $1, $2, or whatever.

    what does the "six" mean in the following line...

    </dd>/six)

    • s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
    • i = match case insensitive
    • x = ignore whitespace in regex.

    x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.

    See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.