I want to crawl the NCBI website and send request for protein local alignment available at this link: http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch
I would like to know if I am able to submit a post request to this address and get the results which come in a new page, using PHP. There is also a issue there, before the final results are shown, the page undergoes multiple redirects - you can test this situation using the following input which goes into the text area:
MHSSIVLATVLFVAIASASKTRELCMKSLEHAKVGTSKEAKQDGIDLYKHMFEHYPAMKKYFKHRENYTP
ADVQKDPFFIKQGQNILLACHVLCATYDDRETFDAYVGELMARHERDHVKVPNDVWNHFWEHFIEFLGSK
TTLDEPTKHAWQEIGKEFSHEISHHGRHSVRDHCMNSLEYIAIGDKEHQKQNGIDLYKHMFEHYPHMRKA
FKGRENFTKEDVQKDAFFVNKDTRFCWPFVCCDSSYDDEPTFDYFVDALMDRHIKDDIHLPQEQWHEFWK
LFAEYLNEKSHQHLTEAEKHAWSTIGEDFAHEADKHAKAEKDHHEGEHKEEHH
Here is my attempt:
$link = 'http://blast.ncbi.nlm.nih.gov/Blast.cgi?
PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch';
$request = array(
'http' => array(
'method' => 'POST',
'content' => http_build_query(array(
'QUERY' => $aaText
)
),
)
);
$context = stream_context_create($request);
$html = file_get_html($link, false, $context);
echo $html;
This code gets me the initial page, as if no POST has been done. Thanks
UPDATE
I have tried one of the suggestions below - Goutte.
Here is my new code:
require_once 'goutte.phar';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', $link);
$form = $crawler->selectButton('b1')->form();
$crawler = $client->submit($form, array('QUERY' => $aaTest));
echo $crawler->html();
Variable $aaTest
is the protein sequence I gave above. The good part is: it posts, gets me the new page, but does not follow all the redirects. How can I make it follow all the redirects?
I should think this site is very crawlable. To understand what is going on, turn off JavaScript in your browser and try to browse the site (to do this, I use the Disable->Disable JavaScript menu in Firebug, which is a Firefox plugin).
If you go to your first link, and paste in your string, you get a form in a POST operation that effectively says your search is in progress. It will look something like this:
Job Title: Protein Sequence (333 letters)
Request ID: NR8ZP8E1071
Since there is not much of interest on this screen, I am assuming that you do not want to scrape from here - but that is effectively what you are currently doing.
What happens next is that a piece of JavaScript submits a hidden form, using this code:
<SCRIPT LANGUAGE="JavaScript">
setTimeout('document.forms[0].submit();',1000);
</SCRIPT>
My guess is that at times of heavy load, the delay here (presently set to 1000ms i.e. 1 second) would increase a bit. The hidden form looks like this:
<form action="Blast.cgi" enctype="application/x-www-form-urlencoded" method="post" name="RequestFormat" id="RequestFormat"">
<input name="CMD" value="Get" type="hidden">
<input name="ALIGNMENTS" value="100" type="hidden">
<input name="ALIGNMENT_VIEW" value="Pairwise" type="hidden">
<input name="BLAST_PROGRAMS" value="blastp" type="hidden">
<input name="CDD_RID" value="data_cache_seq:180192" type="hidden">
<input name="CDD_SEARCH" value="on" type="hidden">
<input name="CDD_SEARCH_STATE" value="4" type="hidden">
<input name="CLIENT" value="web" type="hidden">
<input name="COMPOSITION_BASED_STATISTICS" value="2" type="hidden">
<input name="CONFIG_DESCR" value="2,3,4,5,6,7,8" type="hidden">
<input name="DATABASE" value="nr" type="hidden">
<input name="DESCRIPTIONS" value="100" type="hidden">
<input name="EQ_OP" value="AND" type="hidden">
<input name="EXPECT" value="10" type="hidden">
<input name="FILTER" value="F" type="hidden">
<input name="FORMAT_NUM_ORG" value="1" type="hidden">
<input name="FORMAT_OBJECT" value="Alignment" type="hidden">
<input name="FORMAT_TYPE" value="HTML" type="hidden">
<input name="FULL_DBNAME" value="nr" type="hidden">
<input name="GAPCOSTS" value="11 1" type="hidden">
<input name="GET_SEQUENCE" value="on" type="hidden">
<input name="HSP_RANGE_MAX" value="0" type="hidden">
<input name="JOB_TITLE" value="Protein Sequence (333 letters)" type="hidden">
<input name="LAYOUT" value="OneWindow" type="hidden">
<input name="LINE_LENGTH" value="60" type="hidden">
<input name="MASK_CHAR" value="2" type="hidden">
<input name="MASK_COLOR" value="1" type="hidden">
<input name="MATRIX_NAME" value="BLOSUM62" type="hidden">
<input name="MAX_NUM_SEQ" value="100" type="hidden">
<input name="MYNCBI_USER" value="9311188414" type="hidden">
<input name="NEW_VIEW" value="on" type="hidden">
<input name="NUM_DIFFS" value="0" type="hidden">
<input name="NUM_OPTS_DIFFS" value="0" type="hidden">
<input name="NUM_ORG" value="1" type="hidden">
<input name="NUM_OVERVIEW" value="100" type="hidden">
<input name="OLD_BLAST" value="false" type="hidden">
<input name="OLD_VIEW" value="false" type="hidden">
<input name="PAGE" value="Proteins" type="hidden">
<input name="PAGE_TYPE" value="BlastSearch" type="hidden">
<input name="PROGRAM" value="blastp" type="hidden">
<input name="QUERY_INDEX" value="0" type="hidden">
<input name="QUERY_INFO" value="Protein Sequence (333 letters)" type="hidden">
<input name="QUERY_LENGTH" value="333" type="hidden">
<input name="REPEATS" value="5755" type="hidden">
<input name="RID" value="NR8ZP8E1071" type="hidden">
<input name="RTOE" value="21" type="hidden">
<input name="SELECTED_PROG_TYPE" value="blastp" type="hidden">
<input name="SERVICE" value="plain" type="hidden">
<input name="SHORT_QUERY_ADJUST" value="on" type="hidden">
<input name="SHOW_LINKOUT" value="on" type="hidden">
<input name="SHOW_OVERVIEW" value="on" type="hidden">
<input name="USER_DEFAULT_MATRIX" value="4" type="hidden">
<input name="USER_DEFAULT_PROG_TYPE" value="blastp" type="hidden">
<input name="USER_TYPE" value="2" type="hidden">
<input name="WORD_SIZE" value="3" type="hidden">
<input name="db" value="protein" type="hidden">
<input name="stype" value="protein" type="hidden">
<input name="x" value="41" type="hidden">
<input name="y" value="12" type="hidden">
</form>
This also creates a POST request to the program, and of most interest is the field RID
which links the request with your initial query parameters. This is probably stored in a database or temporary file, and is assigned an ID, which expires in a matter of hours.
When this form is submitted, lots of interesting information is provided, rendered inside the POST request of the form that created it. It is possible that one of the above fields specifies the initial number of alignments to show. If you then turn JavaScript back on, you'll find that pointing at the end of the page (which itself is several screenfuls already) will load another chunk using this program:
Interestingly, a GET request is used here. Using the network monitor in Firefox, I triggered a series of these to see if I could spot a sequence of incrementing numbers. I spotted that SEQ_LIST_START
starts at 1 and increments in blocks of 5, but I am not sure where the elements in ALIGN_SEQ_LIST
comes from - maybe from the current page. It's worth you having a look yourself to see if you can spot anything - especially since you will understand the subject matter in a way that I do not.
You may be able to tinker around with some of the query string parameters in this link to see what controls the number of items returned. However, be careful: if you request a much larger set that their systems are used to, you may be noticed and have a block placed on your IP address.
Further to that, remember that if you crawl a website, you are passing your costs onto a third party. Since the data appears to be available for free, this will be acceptable to them to some degree, and is the benefit of the funding they have already spent. However, be mindful of the load you are placing on their server: don't request chunks that are excessively large, and put a few seconds delay between each request.
If you plan to grab an enormous chunk of data (say more than half a gigabyte), then alternate between a few seconds and a couple of minutes waiting, or perhaps concentrate your downloading during the night (their time) when their servers might be less busy. Failure to "act responsibly" as a crawler may place your IP range on their blocklists, and in the worst cases could constitute a denial of service attack.
So, to summarise, here's what you need to do:
Be willing to tinker with your POST and GET parameters to see the effect, and have fun!