Search code examples
rxml-parsingxml2

I'm having trouble to extract a node from xml. xml_find_all not working as expected


My question might be fairly simple, but I'm having problem to work with xml. I have a list of metabolites and a data base where I can find information about them in an xml format. I'm trying to create a table of synonyms so I can translate the metabolite names I have to one more suited for the downstream analysis. Here is a simple code where I'm trying to access the synonyms node, and for some reason is not working. I tried another xml file with success. Also, any tip on how to build this table will be appreciated.

library(xml2)

metabolites <- read_xml('<?xml version="1.0" encoding="UTF-8"?>
                    <hmdb xmlns="http://www.hmdb.ca">
                    <metabolite>
                    <version>4.0</version>
                    <creation_date>2005-11-16 15:48:42 UTC</creation_date>
                    <update_date>2019-01-11 19:13:56 UTC</update_date>
                    <accession>HMDB0000001</accession>
                    <status>quantified</status>
                    <secondary_accessions>
                    <accession>HMDB00001</accession>
                    <accession>HMDB0004935</accession>
                    </secondary_accessions>
                    <name>1-Methylhistidine</name>
                    <cs_description>1-Methylhistidine, also known as 1-mhis...</cs_description>
                    <description>One-methylhistidine (1-MHis) is derived ...</description>
                    <synonyms>
                    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
                    <synonym>1-Methylhistidine</synonym>
                    <synonym>Pi-methylhistidine</synonym>
                    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
                    <synonym>1 Methylhistidine</synonym>
                    </synonyms>
                    <chemical_formula>C7H11N3O2</chemical_formula>
                    <average_molecular_weight>169.1811</average_molecular_weight>
                    </metabolite>
                    </hmdb>')


syn <- xml_find_all(metabolites, "//synonyms")

Thanks!


Solution

  • It has to do with the namespace declaration. See the discussion here: https://github.com/r-lib/xml2/issues/222

    library(xml2)
    
    metabolites <- read_xml('<hmdb xmlns="http://www.hmdb.ca">
    <metabolite>
    <version>4.0</version>
    <creation_date>2005-11-16 15:48:42 UTC</creation_date>
    <update_date>2019-01-11 19:13:56 UTC</update_date>
    <accession>HMDB0000001</accession>
    <status>quantified</status>
    <secondary_accessions>
    <accession>HMDB00001</accession>
    <accession>HMDB0004935</accession>
    </secondary_accessions>
    <name>1-Methylhistidine</name>
    <cs_description>1-Methylhistidine, also known as 1-mhis...</cs_description>
    <description>One-methylhistidine (1-MHis) is derived ...</description>
    <synonyms>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
    <synonym>1-Methylhistidine</synonym>
    <synonym>Pi-methylhistidine</synonym>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
    <synonym>1 Methylhistidine</synonym>
    </synonyms>
    <chemical_formula>C7H11N3O2</chemical_formula>
    <average_molecular_weight>169.1811</average_molecular_weight>
    </metabolite>
    </hmdb>')
    
    # namespace d1
    xml_ns(metabolites)
    #> d1 <-> http://www.hmdb.ca
    #doesn't work
    xml_find_all(metabolites, "//synonyms")
    #> {xml_nodeset (0)}
    #works
    xml_find_all(metabolites, "//d1:synonyms")
    #> {xml_nodeset (1)}
    #> [1] <synonyms>\n  <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)pro ...
    

    Created on 2019-11-09 by the reprex package (v0.3.0)