Search code examples
perlparsingwww-mechanize

Unable to parse html tags with perl


I am trying to parse the following link using perl

http://www.inc.com/profile/fuhu

I am trying to get information like Rank, 2013 Revenue and 2010 Revenue, etc, But when fetch data with perl, I get following and same shows in Page Source Code.

 <dl class="RankTable">
<div class="dtddwrapper">
  <div class="dtdd">
    <dt>Rank</dt><dd><%=rank%></dd>
  </div>
</div>
<div class="dtddwrapper">

And When I check with Firebug, I get following.

<dl class="RankTable">
<div class="dtddwrapper">
  <div class="dtdd">
    <dt>Rank</dt><dd>1</dd>
  </div>
</div>
<div class="dtddwrapper">

My Perl code is as following.

use WWW::Mechanize;

$url  = "http://www.inc.com/profile/fuhu";
my $mech  = WWW::Mechanize->new();

$mech->get( $url );

$data = $mech->content();
print $data;

Solution

  • As other have said this is not plain HTML, there is some JS wizardry. The data comes from a dynamic JSON request.

    The following script prints the rank and dumps everything else available in $data. First it gets the ID of the profile and then it makes the appropriate JSON request, just like a regular browser.

    use strict;
    use warnings;
    
    use WWW::Mechanize;
    use JSON qw/decode_json/;
    use Data::Dumper;
    
    my $url  = "http://www.inc.com/profile/fuhu";
    my $mech  = WWW::Mechanize->new();
    
    $mech->get( $url );
    
    if ($mech->content() =~ /profileID = (\d+)/) {
        my $id = $1;
        $mech->get("http://www.inc.com/rest/inc5000company/$id/full_list");
        my $data = decode_json($mech->content());
        my $rank = $data->{data}{rank};
    
        print "rank is $rank\n";
        print "\ndata hash value \n";
        print Dumper($data);
    }
    

    Output:

    rank is 1
    
    data hash value 
    $VAR1 = {
              'time' => '2014-08-22 11:40:00',
              'data' => {
                          'ifi_industry' => 'Consumer Products & Services',
                          'app_revenues_lastyear' => '195640000',
                          'industry_rank' => '1',
                          'ifc_company' => 'Fuhu',
                          'current_industry_rank' => '1',
                          'app_employ_fouryearsago' => '49',
                          'ifc_founded' => '2008-00-00',
                          'rank' => '1',
                          'city_display_name' => 'Los Angeles',
                          'metro_rank' => '1',
                          'ifc_business_model' => 'The creator of an Android tablet for kids and an Adobe Air application that allows children to access the Internet in a parent-controlled environment.',
                          'next_id' => '25747',
                          'industry_id' => '4',
                          'metro_id' => '2',
                          'app_employ_lastyear' => '227',
                          'state_rank' => '1',
                          'ifc_filelocation' => 'fuhu',
                          'ifc_url' => 'http://www.fuhu.com',
                          'years' => [
                                       {
                                         'ify_rank' => '1',
                                         'ify_metro_rank' => '1',
                                         'ify_industry_rank' => '1',
                                         'ify_year' => '2014',
                                         'ify_state_rank' => '1'
                                       },
                                       {
                                         'ify_industry_rank' => undef,
                                         'ify_year' => '2013',
                                         'ify_rank' => '1',
                                         'ify_metro_rank' => undef,
                                         'ify_state_rank' => undef
                                       }
                                     ],
                          'ifc_twitter_handle' => 'NabiTablet',
                          'id' => '22890',
                          'app_revenues_fouryearsago' => '123000',
                          'ifc_city' => 'El Segundo',
                          'ifc_state' => 'CA'
                        }
            };