Search code examples
htmlperllwpncbi

Reading a webpage with perl's LWP - output differs from a downloaded html page


I try to access and use different pages in NCBI such as
http://www.ncbi.nlm.nih.gov/nuccore/NM_000036 However, when I used perl's LWP::Simple 'get' function, I do not get the same output I get when I save the page manually (with the firefox browser 'save as html' option). What I do get from the 'get' function lacks the data I require.

Am I doing something wrong? Should I use another tool?

My script is :

use strict;
use warnings;
use LWP::Simple;


my $input_name='GENES.txt';

open (INPUT, $input_name ) || die "unable to open $input_name";
open (OUTPUT,'>', 'Selected_Genes')|| die;

my $line;


while ($line = <INPUT>)
{

    chomp $line;
    print OUTPUT '>'.$line."\n";
    my $URL='http://www.ncbi.nlm.nih.gov/nuccore/'.$line;
#e.g:
#$URL=http://www.ncbi.nlm.nih.gov/nuccore/NM_000036

    my $text=gets($URL);
    print $text."\n";   
    $text=~m!\r?\n\r?\s+\/translation="((?:(?:[^"])\r?\n?\r?)*)"!;
    print OUTPUT $1."\n";

}

Thanks in advance!


Solution

  • Content you're searching is generated by JavaScript. You need to parse your HTML (from the first response) and find ID for the data you want:

    <meta name="ncbi_uidlist" content="289547499" />
    

    Next you need to make another request to the URL in the form: http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=ID_YOU_HAVE

    Something like this (untested!): my $URL='http://www.ncbi.nlm.nih.gov/nuccore/'.$line;

    my $html=gets($URL);
    
    my ($id) = $html =~m{name="ncbi_uidlist" \s+ content="([^"]+)"}xi;
    if ($id) {
        $html=gets( "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=" . $id );
        $text=~m!\r?\n\r?\s+\/translation="((?:(?:[^"])\r?\n?\r?)*)"!;
        print OUTPUT $1."\n";
    }