Search code examples
xmlcsvrdfbioinformaticstransformation

Learn structure of XML file in preparation for CSV or RDF conversion


I would like to convert NCBI's Biosample Metadata XML file to CSV, or RDF/XML as a second choice. To do that, I believe I have to learn more about the structure of this file. I can run basic XQueries in BaseX*, like just listing all <Id> values, but then I've been using shell tools like sort|uniq -c to count them. I have heard about XSLT transformations and GRDDL in passing, but I don't think a style sheet is provided for this XML document, and I don't know how to create or discover one.

For example, can I get a count of the number of <Id>s for each ? Are there any <BioSamples> with more than one primary <Id>? What are the most common db attributes of the primary Ids?

Here's a query that shows my maximum level of XQuery sophistication at this point:

let $sep := '|'
for $bs in doc('biosample_set')/BioSampleSet/BioSample
(: mutiple Id elements, potentially with db, is_primary and db_label attributes :) 
let $id := $bs/Ids/Id[@is_primary="1"]
(: description also has Comment/Paragraph elements :)
let $dt := $bs/Description/Title
let $ti := $bs/Description/Organism/@taxonomy_id
let $mm := $bs/Models/Model
  
return string-join(
       (
         data($id),
         data($dt),
         data($mm),
         data($ti)
       ),
       "|")

In summary, I would be grateful for XQuery snippets or other suggestions that help me with

  • structure discovery
  • aggregating counts
  • best practices for CSV serialization

Or for a single upvotable task: count the number of occurrences of each db attribute for the <Id> elements and serialize as CSV.

I have seen some US and European efforts to turn the Biosample Metadata document into RDF, but they do not appear to be up to date/maintained/well-funded (even though they come from well-regarded teams)

*I have also used Exist-DB and Python's xml.etree.ElementTree and related lxml methods, but am having trouble either loading or processing this 46 GB (unpacked) file.

<?xml version="1.0" encoding="UTF-8"?>
<BioSampleSet>
<BioSample submission_date="2008-04-04T08:44:24.950" last_update="2019-06-20T16:11:22.271" publication_date="2008-04-04T00:00:00.000" access="public" id="2" accession="SAMN00000002">
  <Ids>
    <Id db="BioSample" is_primary="1">SAMN00000002</Id>
    <Id db="WUGSC" db_label="Sample name">19655</Id>
    <Id db="SRA">SRS000002</Id>
  </Ids>
  <Description>
    <Title>Alistipes putredinis DSM 17216</Title>
    <Organism taxonomy_id="445970" taxonomy_name="Alistipes putredinis DSM 17216"/>
    <Comment>
      <Paragraph>Alistipes putredinis (GenBank Accession Number for 16S rDNA gene: L16497) is a member of the Bacteroidetes division of the domain bacteria and has been isolated from human feces. It has been found in 16S rDNA sequence-based enumerations of the colonic microbiota of adult humans (Eckburg et. al. (2005), Ley et. al. (2006)). </Paragraph>
      <Paragraph>Keywords: GSC:MIxS;MIGS:5.0</Paragraph>
    </Comment>
  </Description>
  <Owner>
    <Name abbreviation="WUGSC">Washington University, Genome Sequencing Center</Name>
    <Contacts>
      <Contact email="[email protected]"/>
    </Contacts>
  </Owner>
  <Models>
    <Model>MIGS.ba</Model>
  </Models>
  <Package display_name="MIGS: cultured bacteria/archaea; version 5.0">MIGS.ba.5.0</Package>
  <Attributes>
    <Attribute attribute_name="finishing strategy (depth of coverage)">Level 3: Improved-High-Quality Draft11.6x;20</Attribute>
    <Attribute attribute_name="collection date" harmonized_name="collection_date" display_name="collection date">not determined</Attribute>
    <Attribute attribute_name="estimated_size" harmonized_name="estimated_size" display_name="estimated size">2550000</Attribute>
    <Attribute attribute_name="sop">http://hmpdacc.org/doc/CommonGeneAnnotation_SOP.pdf</Attribute>
    <Attribute attribute_name="project_type">Reference Genome</Attribute>
    <Attribute attribute_name="host" harmonized_name="host" display_name="host">Homo sapiens</Attribute>
    <Attribute attribute_name="lat_lon" harmonized_name="lat_lon" display_name="latitude and longitude">not determined</Attribute>
    <Attribute attribute_name="biome" harmonized_name="env_broad_scale" display_name="broad-scale environmental context">terrestrial biome [ENVO:00000446]</Attribute>
    <Attribute attribute_name="misc_param: HMP body site">not determined</Attribute>
    <Attribute attribute_name="nucleic acid extraction">not determined</Attribute>
    <Attribute attribute_name="feature" harmonized_name="env_local_scale" display_name="local-scale environmental context">human-associated habitat [ENVO:00009003]</Attribute>
    <Attribute attribute_name="investigation_type" harmonized_name="investigation_type" display_name="investigation type">missing</Attribute>
    <Attribute attribute_name="host taxid" harmonized_name="host_taxid" display_name="host taxonomy ID">9606</Attribute>
    <Attribute attribute_name="project_name" harmonized_name="project_name" display_name="project name">Alistipes putredinis DSM 17216</Attribute>
    <Attribute attribute_name="assembly">PCAP</Attribute>
    <Attribute attribute_name="geo_loc_name" harmonized_name="geo_loc_name" display_name="geographic location">not determined</Attribute>
    <Attribute attribute_name="source_mat_id" harmonized_name="source_material_id" display_name="source material identifiers">DSM 17216, CCUG 45780, CIP 104286, ATCC 29800, Carlier 10203, VPI 3293</Attribute>
    <Attribute attribute_name="material" harmonized_name="env_medium" display_name="environmental medium">biological product [ENVO:02000043]</Attribute>
    <Attribute attribute_name="ref_biomaterial" harmonized_name="ref_biomaterial" display_name="reference for biomaterial">not determined</Attribute>
    <Attribute attribute_name="misc_param: HMP supersite">gastrointestinal_tract</Attribute>
    <Attribute attribute_name="num_replicons" harmonized_name="num_replicons" display_name="number of replicons">not determined</Attribute>
    <Attribute attribute_name="sequencing method">454-GS20, Sanger</Attribute>
    <Attribute attribute_name="isol_growth_condt" harmonized_name="isol_growth_condt" display_name="isolation and growth condition">not determined</Attribute>
    <Attribute attribute_name="env_package" harmonized_name="env_package" display_name="environmental package">missing</Attribute>
    <Attribute attribute_name="strain" harmonized_name="strain" display_name="strain">DSM 17216</Attribute>
    <Attribute attribute_name="isolation-source" harmonized_name="isolation_source" display_name="isolation source">missing</Attribute>
    <Attribute attribute_name="type-material">type strain of Alistipes putredinis</Attribute>
  </Attributes>
  <Links>
    <Link type="url" label="DNA Source">http://www.dsmz.de/catalogues/details/culture/DSM-17216</Link>
    <Link type="entrez" target="bioproject">19655</Link>
  </Links>
  <Status status="live" when="2013-08-05T10:18:49"/>
</BioSample>

Solution

  • similar to my answer for https://www.biostars.org/p/280581/ using my tool xsltstream:

    $ wget -q -O - "http://ftp.ncbi.nlm.nih.gov//biosample/biosample_set.xml.gz" | gunzip -c | java -jar dist/xsltstream.jar -n BioSample -t ~/jeter.xsl |  head
    SAMN00000002        SRS000002   Alistipes putredinis DSM 17216  445970  MIGS.ba 
    SAMN00000003        SRS000003   Anaerotruncus colihominis DSM 17241 445972  MIGS.ba 
    SAMN00000004        SRS000004   MIGS Cultured Bacterial/Archaeal sample from Bacteroides stercoris ATCC 43183   449673  MIGS.ba 
    SAMN00000005        SRS000005   Generic sample from Biomphalaria glabrata   6526Generic 
    SAMN00000006        SRS000006   Generic sample from Callithrix jacchus  9483    Generic 
    SAMN00000007        SRS000007   Clostridium ramosum DSM 1402    445974  MIGS.ba 
    SAMN00000008        SRS000008   MIGS Cultured Bacterial/Archaeal sample from Dorea formicigenerans ATCC 27755   411461  MIGS.ba 
    SAMN00000009        SRS000009   Generic sample from Monodelphis domestica   13616Generic    
    SAMN00000010        SRS000010   Generic sample from Ruminococcus sp. GM2/1  451639  Generic 
    SAMN00000011        SRS000011   Generic sample from Roseburia faecis M72/1  451638  Generic 
    

    with "jeter.xsl"

    <?xml version='1.0' encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'>
    <xsl:output method="text"  encoding="UTF-8"/>
    <xsl:template match="BioSample">
    <xsl:value-of select="Ids/Id[@db='BioSample']/text()"/>
    <xsl:text>  </xsl:text>
    <xsl:value-of select="Ids/Id[@db='UGAML']/text()"/>
    <xsl:text>  </xsl:text>
    <xsl:value-of select="Ids/Id[@db='SRA']/text()"/>
    <xsl:text>  </xsl:text>
    <xsl:value-of select="Description/Title/text()"/>
    <xsl:text>  </xsl:text>
    <xsl:value-of select="Description/Organism/@taxonomy_id"/>
    <xsl:text>  </xsl:text>
    <xsl:value-of select="Models/Model/text()"/>
    <xsl:text>  </xsl:text>
    <xsl:text>
    </xsl:text>
    </xsl:template>
    
    </xsl:stylesheet>