Search code examples
rxmldataframexml2

xml2 processing unclear error while transforming into dataframe


I am trying to convert a xml file to a dataframe, while for some elements it is working nicely, for others it is not. I am not sure why.

Here is a simple version of the XML:

<?xml version="1.0" encoding="UTF-8"?>
  <clinical_study rank="6838">
    <arm_group>
    <arm_group_label>Arm I (Lozenge placebo)</arm_group_label>
    <arm_group_type>Placebo Comparator</arm_group_type>
    <description>Patients receive lozenge placebo PO QID.</description>
    </arm_group>
    <arm_group>
    <arm_group_label>Arm II (LBR lozenge)</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
    <description>Patients receive lyophilized black raspberries lozenge PO (8gms/day)</description>
    </arm_group>
    <arm_group>
    <arm_group_label>Arm III (Saliva Substitute placebo)</arm_group_label>
    <arm_group_type>Placebo Comparator</arm_group_type>
    <description>Patients receive Saliva Substitute placebo PO QID.</description>
    </arm_group>
    <arm_group>
    <arm_group_label>Arm IV (LBR Saliva Substitute)</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
    <description>Patients receive lyophilized black raspberries Saliva Substitute PO (8gms/day).</description>
    </arm_group>
    <condition_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm            -->
    <mesh_term>Carcinoma</mesh_term>
    <mesh_term>Carcinoma, Squamous Cell</mesh_term>
    <mesh_term>Laryngeal Diseases</mesh_term>
    <mesh_term>Laryngeal Neoplasms</mesh_term>
    <mesh_term>Oropharyngeal Neoplasms</mesh_term>
    <mesh_term>Carcinoma, Verrucous</mesh_term>
    <mesh_term>Nasopharyngeal Neoplasms</mesh_term>
    <mesh_term>Salivary Gland Neoplasms</mesh_term>
    <mesh_term>Paranasal Sinus Neoplasms</mesh_term>
    <mesh_term>Head and Neck Neoplasms</mesh_term>
    <mesh_term>Neoplasms, Unknown Primary</mesh_term>
    <mesh_term>Mouth Neoplasms</mesh_term>
    <mesh_term>Hypopharyngeal Neoplasms</mesh_term>
    <mesh_term>Tongue Neoplasms</mesh_term>
    <mesh_term>Lip Neoplasms</mesh_term>
    <mesh_term>Carcinoma in Situ</mesh_term>
    </condition_browse>
    <!-- Results have not yet been posted for this study                                          -->
    </clinical_study>

And the code that I am using (working one):

library(XML)
library(dplyr)
library(xml2)
# read group
outc <- xml_find_all(xml, "//arm_group") %>% as_list() %>% dplyr::bind_rows() %>% as.data.frame()

And the piece of code not working:

test1 <- xml_find_all(xml, "//condition_browse") %>% as_list() %>% dplyr::bind_rows() %>% as.data.frame()

This second piece of code produces a dataset with 1 line, instead of the multiline dataframe expected.

I am not able to determine if the error comes from my xml2 syntax, the xpath one or simply from the xml data.

Could you please support ?


Solution

  • All of the nodes under the condition_browse are labeled: "mesh_term". bind_rows is merging the like named rows thus resulting in only the last one being saved.
    Try using

    temp <-  xml_find_all(xml, "//condition_browse") %>% as_list() %>% unlist() 
    
    #convert into data frame
    test1 <-data.frame(names=names(temp), value=temp)
    

    This will provide a slightly different format but should provide a good start to the rest of you analysis.