im trying to get the xml presented here http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml but its a bit tricky cause they dont give any suport for it. The purpose is to get the xml to php in order to go trought the xml.
can someone give a hint?
It's not really true that XML presented via HTML therein wouldn't be XML as well.
What you're looking for is something called textContent in DOMDocument. That will give you only the text from that HMTL. Like it is displayed "as text" in the browser.
So all you need to do is to load the HTML document into a DOMDocument
. Because it contains errors the internal error are used:
$url = 'http://www.ncbi.nlm.nih.gov/sra/ERX086768?report=FullXml';
$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$doc->loadHTMLFile($url);
libxml_use_internal_errors(FALSE);
The next part implies specific knowledge about the page being scraped. In your case the XML is the said text-content of all div-tags with class attribute "xml-tag" *followed* after the tag with the id "ResultView".
These tags can be easily fetched with an xpath query, then their text-content is stored into an array:
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//*[@id="ResultView"]/following-sibling::div[@class="xml-tag"]');
$buffer = array();
foreach ($nodes as $node) {
$buffer[] = $node->textContent;
}
So everything left now is to create a new DOMDocument
and load that XML buffer into it, doing some nice formattings and the output:
$new = new DOMDocument();
$new->preserveWhiteSpace = FALSE;
$new->formatOutput = TRUE;
$new->loadXML(implode('', $buffer));
$new->save('php://output');
These roughly 20 lines of code produce the following output then:
<?xml version="1.0"?>
<EXPERIMENT_PACKAGE>
<EXPERIMENT alias="SC_EXP_7229_8#56" center_name="SC" accession="ERX086768">
<IDENTIFIERS>
<PRIMARY_ID>ERX086768</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
</IDENTIFIERS>
<TITLE/>
<STUDY_REF accession="ERP000913" refname="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" refcenter="SC">
<IDENTIFIERS>
<PRIMARY_ID>ERP000913</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
</IDENTIFIERS>
</STUDY_REF>
<DESIGN>
<DESIGN_DESCRIPTION>Standard</DESIGN_DESCRIPTION>
<SAMPLE_DESCRIPTOR accession="ERS074283" refname="MR223754-sc-2011-11-18T11:31:44Z-1306470" refcenter="SC">
<IDENTIFIERS>
<PRIMARY_ID>ERS074283</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
</IDENTIFIERS>
</SAMPLE_DESCRIPTOR>
<LIBRARY_DESCRIPTOR>
<LIBRARY_NAME>4008297</LIBRARY_NAME>
<LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
<LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
<LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
<LIBRARY_LAYOUT>
<PAIRED NOMINAL_LENGTH="250"/>
</LIBRARY_LAYOUT>
</LIBRARY_DESCRIPTOR>
<SPOT_DESCRIPTOR>
<SPOT_DECODE_SPEC>
<READ_SPEC>
<READ_INDEX>0</READ_INDEX>
<READ_CLASS>Application Read</READ_CLASS>
<READ_TYPE>Forward</READ_TYPE>
<BASE_COORD>1</BASE_COORD>
</READ_SPEC>
<READ_SPEC>
<READ_INDEX>1</READ_INDEX>
<READ_CLASS>Application Read</READ_CLASS>
<READ_TYPE>Reverse</READ_TYPE>
<RELATIVE_ORDER follows_read_index="0"/>
</READ_SPEC>
</SPOT_DECODE_SPEC>
</SPOT_DESCRIPTOR>
</DESIGN>
<PLATFORM>
<ILLUMINA>
<INSTRUMENT_MODEL>Illumina HiSeq 2000</INSTRUMENT_MODEL>
</ILLUMINA>
</PLATFORM>
<PROCESSING/>
</EXPERIMENT>
<SUBMISSION accession="ERA119046" center_name="SC" submission_date="2012-04-17T09:29:50Z" alias="ERP000913-sc-20120417-2" lab_name="">
<IDENTIFIERS>
<PRIMARY_ID>ERA119046</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">ERP000913-sc-20120417-2</SUBMITTER_ID>
</IDENTIFIERS>
</SUBMISSION>
<STUDY alias="Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977" center_name="SC" accession="ERP000913">
<IDENTIFIERS>
<PRIMARY_ID>ERP000913</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis-sc-2011-09-22T08:43:17Z-1977</SUBMITTER_ID>
</IDENTIFIERS>
<DESCRIPTOR>
<STUDY_TITLE>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</STUDY_TITLE>
<STUDY_TYPE existing_study_type="Whole Genome Sequencing"/>
<STUDY_ABSTRACT>http://www.sanger.ac.uk/resources/downloads/bacteria/</STUDY_ABSTRACT>
<CENTER_PROJECT_NAME>Genome_diversity_in_Streptococcus_dysgalactiae_subspecies_equisimilis</CENTER_PROJECT_NAME>
<STUDY_DESCRIPTION>http://www.sanger.ac.uk/resources/downloads/bacteria/
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/</STUDY_DESCRIPTION>
</DESCRIPTOR>
</STUDY>
<SAMPLE alias="MR223754-sc-2011-11-18T11:31:44Z-1306470" center_name="SC" accession="ERS074283">
<IDENTIFIERS>
<PRIMARY_ID>ERS074283</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">MR223754-sc-2011-11-18T11:31:44Z-1306470</SUBMITTER_ID>
</IDENTIFIERS>
<SAMPLE_NAME>
<COMMON_NAME>Streptococcus dysgalactiae subspecies equisimilis</COMMON_NAME>
<TAXON_ID>119602</TAXON_ID>
<SCIENTIFIC_NAME>Streptococcus dysgalactiae subsp. equisimilis</SCIENTIFIC_NAME>
</SAMPLE_NAME>
<SAMPLE_LINKS>
<SAMPLE_LINK>
<ENTREZ_LINK>
<DB>biosample</DB>
<ID>859730</ID>
</ENTREZ_LINK>
</SAMPLE_LINK>
</SAMPLE_LINKS>
<SAMPLE_ATTRIBUTES>
<SAMPLE_ATTRIBUTE>
<TAG>Strain</TAG>
<VALUE>MR223754</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>Sample Description</TAG>
<VALUE/>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ArrayExpress-StrainOrLine</TAG>
<VALUE>MR223754</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ArrayExpress-Sex</TAG>
<VALUE>not applicable</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>ArrayExpress-Species</TAG>
<VALUE>Streptococcus dysgalactiae subspecies equisimilis</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
<RUN_SET>
<RUN alias="SC_RUN_7229_8#56" center_name="SC" accession="ERR109334" total_spots="2708543" total_bases="406281450" size="334475592" load_done="true" published="2012-04-27 20:11:35" is_public="true" cluster_name="public" static_data_available="1">
<IDENTIFIERS>
<PRIMARY_ID>ERR109334</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">SC_RUN_7229_8#56</SUBMITTER_ID>
</IDENTIFIERS>
<EXPERIMENT_REF refname="SC_EXP_7229_8#56" refcenter="SC" accession="ERX086768">
<IDENTIFIERS>
<PRIMARY_ID>ERX086768</PRIMARY_ID>
<SUBMITTER_ID namespace="SC">SC_EXP_7229_8#56</SUBMITTER_ID>
</IDENTIFIERS>
</EXPERIMENT_REF>
<Pool>
<Member member_name="" accession="ERS074283" sample_name="MR223754-sc-2011-11-18T11:31:44Z-1306470" spots="2708543" bases="406281450"/>
</Pool>
</RUN>
</RUN_SET>
</EXPERIMENT_PACKAGE>
So don't re-invent the wheel, just learn about the existing tools. It's sometimes more easy than it looks like on first sight.