Search code examples
rxmlxml2

Reading an XML file into an R table


I have data in a specialized XML file called HML. The file is very large, containing data on hundreds of samples, and has a structure that looks like this (first several lines of the file):

<?xml version="1.0" encoding="utf-8"?>
<hml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schemas.nmdp.org/spec/hml/1.0.1  http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd" version="1.0.1" project-name="NGS_low_res_TruSight_NGSengine" xmlns="http://schemas.nmdp.org/spec/hml/1.0.1">
  <hmlid root="2.16.840.1.113883.19.3.999.5" />
  <reporting-center reporting-center-id="UPABRO" />
  <sample id="19-45678-HLA-072319AB" center-code="UPABRO">
    <collection-method>unknown</collection-method>
    <typing gene-family="HLA" date="2022-08-03">
      <allele-assignment date="2022-06-13" allele-db="IMGT/HLA" allele-version="3.47.0">
        <glstring>HLA-A*02:01:01:01+HLA-A*29:02:01:01</glstring>

The data I am interested in getting out is the sample id (line 5) and the glstring (last line), although there are several lines of glstring in each sample.

I have used the xml2 package to read the file:

library(xml2)
test <- read_xml("test.hml")
xml_children(test)

Which gives the following output:

{xml_nodeset (386)}
 [1] <hmlid root="2.16.840.1.113883.19.3.999.5"/>
 [2] <reporting-center reporting-center-id="UPABRO"/>
 [3] <sample id="19-45678-HLA-072319AB" center-code="UPABRO">\n  <collection-m ...
 [4] <sample id="19-45679-HLA-011620-AB-NGS" center-code="UPABRO">\n  <collect ...
 [5] <sample id="19-45680-HLA-080819-AB-NGS" center-code="UPABRO">\n  <collect ...

The individual samples that I'm interested in obviously start at child 3. However, I cannot find a way to start extracting data at child 3, and then only extracting certain nodes in each child.


Solution

  • Since the glstring is the contents of the node and not the attribute, use the xml_text() function to retrieve the node's value.

    library(xml2)
    library(dplr)
    
    bar2 <- test   %>% 
       xml_find_all( "//d1:sample")   %>% 
       xml_find_all( ".//d1:glstring"))  %>% xml_text()
    

    Most likely since you mention that there are multiple glstrings per sample ID. You will need to loop through all of the sample IDs and extract out the glstrings for each sample.

    This example code should work (I have not tested since the xml sample above was not a complete sample ID)

    samples <-  test   %>% 
       xml_find_all( ".//d1:sample")
    
    dfs <- lapply(samples, function(node){
       #get sample ID 
       sampleID <- node %>% xml_attr("id")
       #get glstrings 
       glstring <- node %>%  xml_find_all( ".//d1:glstring"))  %>% xml_text()
    
       data.frame(sampleID, glstring)
    })
    answer <-bind_rows(dfs)