Search code examples

How to Parse this HTML with Web::Scraper?

I am trying to use Web::Scraper to parse the following HTML:



      'test' => [
                    'name' => 'TITLE1',
                    'desc' => 'DESCRIPTION1 '
                    'name' => 'TITLE2',
                    'desc' => 'DESCRIPTION2 '
                    'name' => 'TITLE3',
                    'desc' => 'DESCRIPTION3 '

I have the following code but I don't have much luck. 'TEXT' when processing 'p' gives both the text and what is between "strong" for example

      'test' => [
                    'name' => 'TITLE1',
                    'desc' => 'TITLE1 DESCRIPTION1 '

plus its only the first item.

Here is my code.

use strict;
use Web::Scraper;
use Data::Dumper;

my $html = q[<div>

 my $test = scraper {
 process 'div', 'test[]' => scraper {
    process 'p strong', 'name' => 'TEXT';
    process 'p','desc' => 'TEXT';       

  my $res = $test->scrape(\$html);
  print Dumper($res);   

Thank you.


  • There are two points in your code that need changing.

    To get only the DESCRIPTION-text, use xpath. //p/text() will give you the text-nodes directly under any p, so the ones inside of the strong are not included.

    To make all blocks of p show up in the array, and not only the first one, make the first instruction be on div p. That way it grabs all p inside of a div and not only the one div.

    my $test = scraper {
        process 'div p', 'test[]' => scraper {
            process 'p strong',           'name' => 'TEXT';
            process '//p/text()', 'desc' => ['TEXT', sub { s/^\s+|\s+$//g } ];

    Output (with Data::Printer):

    \ {
        test   [
            [0] {
                desc   "DESCRIPTION1",
                name   "TITLE1"
            [1] {
                desc   "DESCRIPTION2",
                name   "TITLE2"
            [2] {
                desc   "DESCRIPTION3",
                name   "TITLE3"