I am trying to scrape the HTML of http://www.imdb.com/find?q=Yek+mard%2C+yek+khers&s=all. The result set contains one single result, within the class result_text
. So I enter the link, take the text within that link, which, in this case, as Firebug shows, is A Man, a Bear
. But strangely, the following code prints out Yek mard, yek khers
. Can anyone help me on how to get the text which I am seeing in the browser?
$name = "Yek mard, yek khers";
$uri = URI->new("http://www.imdb.com/find?q=".uri_escape($name)."&s=all");
my $response = $ua->get( $uri );
my $root = HTML::TreeBuilder->new_from_content($response->decoded_content);
@results = $root->find_by_attribute("class","result_text");
$link = $results[0]->find_by_tag_name("a");
say $link->as_HTML();
# This should print <a href="/title/tt0122857/?ref_=fn_al_tt_1">A Man, a Bear</a>
# but prints <a href="/title/tt0122857/?ref_=fn_al_tt_1">Yek mard, yek khers</a>
Update
My apologies. After looking further I have found that IMDb uses the Accept-Language
header of the HTTP request to determine how to render the page. By default LWP doesn't send this header at all, but Firefox does, which is why my solution above works correctly.
So a solution using only LWP
is possible. A tailored request must first be built using an HTTP::Request
object, and passed to a LWP::UserAgent
object using the request
method.
This code demonstrates.
use strict;
use warnings;
use feature 'say';
use LWP;
use HTML::TreeBuilder::XPath;
my $url = 'http://www.imdb.com/find?q=Yek+mard%2C+yek+khers&s=all';
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => $url, ['Accept-Language' => 'en-gb,en']);
my $resp = $ua->request($req);
my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my @results = $tree->findnodes_as_strings('//td[@class="result_text"]/a/text()');
say $results[0];
The output is as before.
Original Answer
The problem is that the content you are seeing in your browser is generated by JavaScript code after the page has loaded. The simple combination of LWP
and HTML::TreeBuilder
cannot process anything other than the raw HTML returned by the site.
The usual solution recommended for this is to use the WWW::Mechanize::Firefox
module, which uses a live Firefox process to fetch the HTML and JavaScript and render the page. Note that it requires the Firefox browser to be installed on your machine, and the MozRepl
Firefox addon must be installed and running.
This program shows working code that returns the result you expect. Note that I have also used HTML::TreeBuilder::XPath
instead of the bare HTML::TreeBuilder
which allows much simpler expression of the parts of the HTML you are interested in.
use strict;
use warnings;
use feature 'say';
use WWW::Mechanize::Firefox;
use HTML::TreeBuilder::XPath;
my $url = 'http://www.imdb.com/find?q=Yek+mard%2C+yek+khers&s=all';
my $mech = WWW::Mechanize::Firefox->new;
$mech->get($url);
my $tree = HTML::TreeBuilder::XPath->new_from_content($mech->response->content);
my @results = $tree->findnodes_as_strings('//td[@class="result_text"]/a/text()');
say $results[0];
output
A Man, a Bear