I would like to convert NCBI's Biosample Metadata XML file to CSV, or RDF/XML as a second choice. To do that, I believe I have to learn more about the structure of this file. I can run basic XQueries in BaseX*, like just listing all <Id>
values, but then I've been using shell tools like sort|uniq -c
to count them. I have heard about XSLT
transformations and GRDDL
in passing, but I don't think a style sheet is provided for this XML document, and I don't know how to create or discover one.
For example, can I get a count of the number of <Id>
s for each ? Are there any <BioSamples>
with more than one primary <Id>
? What are the most common db attributes of the primary Ids?
Here's a query that shows my maximum level of XQuery sophistication at this point:
let $sep := '|'
for $bs in doc('biosample_set')/BioSampleSet/BioSample
(: mutiple Id elements, potentially with db, is_primary and db_label attributes :)
let $id := $bs/Ids/Id[@is_primary="1"]
(: description also has Comment/Paragraph elements :)
let $dt := $bs/Description/Title
let $ti := $bs/Description/Organism/@taxonomy_id
let $mm := $bs/Models/Model
return string-join(
(
data($id),
data($dt),
data($mm),
data($ti)
),
"|")
In summary, I would be grateful for XQuery snippets or other suggestions that help me with
Or for a single upvotable task: count the number of occurrences of each db attribute for the <Id>
elements and serialize as CSV.
I have seen some US and European efforts to turn the Biosample Metadata document into RDF, but they do not appear to be up to date/maintained/well-funded (even though they come from well-regarded teams)
*I have also used Exist-DB and Python's xml.etree.ElementTree
and related lxml
methods, but am having trouble either loading or processing this 46 GB (unpacked) file.
<?xml version="1.0" encoding="UTF-8"?>
<BioSampleSet>
<BioSample submission_date="2008-04-04T08:44:24.950" last_update="2019-06-20T16:11:22.271" publication_date="2008-04-04T00:00:00.000" access="public" id="2" accession="SAMN00000002">
<Ids>
<Id db="BioSample" is_primary="1">SAMN00000002</Id>
<Id db="WUGSC" db_label="Sample name">19655</Id>
<Id db="SRA">SRS000002</Id>
</Ids>
<Description>
<Title>Alistipes putredinis DSM 17216</Title>
<Organism taxonomy_id="445970" taxonomy_name="Alistipes putredinis DSM 17216"/>
<Comment>
<Paragraph>Alistipes putredinis (GenBank Accession Number for 16S rDNA gene: L16497) is a member of the Bacteroidetes division of the domain bacteria and has been isolated from human feces. It has been found in 16S rDNA sequence-based enumerations of the colonic microbiota of adult humans (Eckburg et. al. (2005), Ley et. al. (2006)). </Paragraph>
<Paragraph>Keywords: GSC:MIxS;MIGS:5.0</Paragraph>
</Comment>
</Description>
<Owner>
<Name abbreviation="WUGSC">Washington University, Genome Sequencing Center</Name>
<Contacts>
<Contact email="[email protected]"/>
</Contacts>
</Owner>
<Models>
<Model>MIGS.ba</Model>
</Models>
<Package display_name="MIGS: cultured bacteria/archaea; version 5.0">MIGS.ba.5.0</Package>
<Attributes>
<Attribute attribute_name="finishing strategy (depth of coverage)">Level 3: Improved-High-Quality Draft11.6x;20</Attribute>
<Attribute attribute_name="collection date" harmonized_name="collection_date" display_name="collection date">not determined</Attribute>
<Attribute attribute_name="estimated_size" harmonized_name="estimated_size" display_name="estimated size">2550000</Attribute>
<Attribute attribute_name="sop">http://hmpdacc.org/doc/CommonGeneAnnotation_SOP.pdf</Attribute>
<Attribute attribute_name="project_type">Reference Genome</Attribute>
<Attribute attribute_name="host" harmonized_name="host" display_name="host">Homo sapiens</Attribute>
<Attribute attribute_name="lat_lon" harmonized_name="lat_lon" display_name="latitude and longitude">not determined</Attribute>
<Attribute attribute_name="biome" harmonized_name="env_broad_scale" display_name="broad-scale environmental context">terrestrial biome [ENVO:00000446]</Attribute>
<Attribute attribute_name="misc_param: HMP body site">not determined</Attribute>
<Attribute attribute_name="nucleic acid extraction">not determined</Attribute>
<Attribute attribute_name="feature" harmonized_name="env_local_scale" display_name="local-scale environmental context">human-associated habitat [ENVO:00009003]</Attribute>
<Attribute attribute_name="investigation_type" harmonized_name="investigation_type" display_name="investigation type">missing</Attribute>
<Attribute attribute_name="host taxid" harmonized_name="host_taxid" display_name="host taxonomy ID">9606</Attribute>
<Attribute attribute_name="project_name" harmonized_name="project_name" display_name="project name">Alistipes putredinis DSM 17216</Attribute>
<Attribute attribute_name="assembly">PCAP</Attribute>
<Attribute attribute_name="geo_loc_name" harmonized_name="geo_loc_name" display_name="geographic location">not determined</Attribute>
<Attribute attribute_name="source_mat_id" harmonized_name="source_material_id" display_name="source material identifiers">DSM 17216, CCUG 45780, CIP 104286, ATCC 29800, Carlier 10203, VPI 3293</Attribute>
<Attribute attribute_name="material" harmonized_name="env_medium" display_name="environmental medium">biological product [ENVO:02000043]</Attribute>
<Attribute attribute_name="ref_biomaterial" harmonized_name="ref_biomaterial" display_name="reference for biomaterial">not determined</Attribute>
<Attribute attribute_name="misc_param: HMP supersite">gastrointestinal_tract</Attribute>
<Attribute attribute_name="num_replicons" harmonized_name="num_replicons" display_name="number of replicons">not determined</Attribute>
<Attribute attribute_name="sequencing method">454-GS20, Sanger</Attribute>
<Attribute attribute_name="isol_growth_condt" harmonized_name="isol_growth_condt" display_name="isolation and growth condition">not determined</Attribute>
<Attribute attribute_name="env_package" harmonized_name="env_package" display_name="environmental package">missing</Attribute>
<Attribute attribute_name="strain" harmonized_name="strain" display_name="strain">DSM 17216</Attribute>
<Attribute attribute_name="isolation-source" harmonized_name="isolation_source" display_name="isolation source">missing</Attribute>
<Attribute attribute_name="type-material">type strain of Alistipes putredinis</Attribute>
</Attributes>
<Links>
<Link type="url" label="DNA Source">http://www.dsmz.de/catalogues/details/culture/DSM-17216</Link>
<Link type="entrez" target="bioproject">19655</Link>
</Links>
<Status status="live" when="2013-08-05T10:18:49"/>
</BioSample>
similar to my answer for https://www.biostars.org/p/280581/ using my tool xsltstream:
$ wget -q -O - "http://ftp.ncbi.nlm.nih.gov//biosample/biosample_set.xml.gz" | gunzip -c | java -jar dist/xsltstream.jar -n BioSample -t ~/jeter.xsl | head
SAMN00000002 SRS000002 Alistipes putredinis DSM 17216 445970 MIGS.ba
SAMN00000003 SRS000003 Anaerotruncus colihominis DSM 17241 445972 MIGS.ba
SAMN00000004 SRS000004 MIGS Cultured Bacterial/Archaeal sample from Bacteroides stercoris ATCC 43183 449673 MIGS.ba
SAMN00000005 SRS000005 Generic sample from Biomphalaria glabrata 6526Generic
SAMN00000006 SRS000006 Generic sample from Callithrix jacchus 9483 Generic
SAMN00000007 SRS000007 Clostridium ramosum DSM 1402 445974 MIGS.ba
SAMN00000008 SRS000008 MIGS Cultured Bacterial/Archaeal sample from Dorea formicigenerans ATCC 27755 411461 MIGS.ba
SAMN00000009 SRS000009 Generic sample from Monodelphis domestica 13616Generic
SAMN00000010 SRS000010 Generic sample from Ruminococcus sp. GM2/1 451639 Generic
SAMN00000011 SRS000011 Generic sample from Roseburia faecis M72/1 451638 Generic
with "jeter.xsl"
<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'>
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="BioSample">
<xsl:value-of select="Ids/Id[@db='BioSample']/text()"/>
<xsl:text> </xsl:text>
<xsl:value-of select="Ids/Id[@db='UGAML']/text()"/>
<xsl:text> </xsl:text>
<xsl:value-of select="Ids/Id[@db='SRA']/text()"/>
<xsl:text> </xsl:text>
<xsl:value-of select="Description/Title/text()"/>
<xsl:text> </xsl:text>
<xsl:value-of select="Description/Organism/@taxonomy_id"/>
<xsl:text> </xsl:text>
<xsl:value-of select="Models/Model/text()"/>
<xsl:text> </xsl:text>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>