Search code examples
rxmldataframeasciientity

Convert in R a XML with ASCII Entity Names to a basic XML


I have the following XML file:

<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:pdfx_1_="ns.adobe.org/pdfx/1.3/">
         <xmp:CreateDate>2021-05-30T11:17:35+02:00</xmp:CreateDate>
         <xmp:CreatorTool>TeX</xmp:CreatorTool>
         <xmp:ModifyDate>2021-05-30T12:12:25+02:00</xmp:ModifyDate>
         <xmp:MetadataDate>2021-05-30T12:12:25+02:00</xmp:MetadataDate>
         <pdfx:PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) kpathsea version 6.3.1</pdfx:PTEX.Fullbanner>
         <pdf:Producer>pdfTeX-1.40.20</pdf:Producer>
         <pdf:Trapped>Unknown</pdf:Trapped>
         <pdf:Keywords/>
         <dc:format>application/pdf</dc:format>
         <xmpMM:DocumentID>uuid:38d0617c-0385-5941-a87d-cc4a1e54bd76</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:d056c61c-55c6-5f44-8c0e-fe6e911c2ed9</xmpMM:InstanceID>
         <pdfwe:dafra>&#xA;&#xA;&lt;?xml version="1.0"?&gt;&#xA;&#x9;&lt;dataframe name="expData" &#xA;&#x9;&#x9;xmlns="url"&#xA;&#x9;&#x9;xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"&#xA;&#x9;&#x9;xsi:schemaLocation="url"&gt;&#xA;&#x9;&#x9;&lt;column name="DATA" type="ratio"&gt;&#xA;&#x9;&#x9;&#x9;&lt;value&gt;14&lt;/value&gt;&#xA;&#x9;&#x9;&#x9;&lt;value&gt;18&lt;/value&gt;&#xA;&#x9;&#x9;&#x9;&lt;value&gt;21&lt;/value&gt;&#xA;&#x9;&#x9;&#x9;&lt;value&gt;35&lt;/value&gt;&#xA;&#x9;&#x9;&#x9;&lt;value&gt;44&lt;/value&gt;&#xA;&#x9;&#x9;&#x9;&lt;value&gt;50&lt;/value&gt;&#xA;&#x9;&#x9;&#x9;&lt;value&gt;3&lt;/value&gt;&#xA;&#x9;&#x9;&#x9;&lt;value&gt;5&lt;/value&gt;&#xA;&#x9;&#x9;&#x9;&lt;value&gt;7&lt;/value&gt;&#xA;&#x9;&#x9;&lt;/column&gt;&#xA;&#x9;&lt;/dataframe&gt;&#xA;&#xA;&#x9;&#x9;&#x9;</pdfx_1_:Dataframe>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    
                           
<?xpacket end="w"?>

As you can see, the tag Dataframe of the namespace pdfwe have inside it another XML. I need to extract this XML and convert it to a normal XML with no ASCII Entity Names like the following:

<?xml version="1.0"?>
    <dataframe name="expData" 
        xmlns="url"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="url">
        <column name="DATA" type="ratio">
            <value>14</value>
            <value>18</value>
            <value>21</value>
            <value>35</value>
            <value>44</value>
            <value>50</value>
            <value>3</value>
            <value>5</value>
            <value>7</value>
        </column>
    </dataframe>

To extract what's inside pdfwe:dafra I'm using the function xml_find_all(x, ".//pdfwe:dafra") of the xml2 package but I'm not getting the result I want.

To convert the Entity Names I'm using the function xml2::xml_text(xml2::read_xml(paste0("<x>", md, "</x>"))) but I'm not getting the results I want either.

Thanks in advance!


Solution

  • The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml() function.

    library(xml2)
    
    page <- read_xml('<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
    <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08">
       .....')  #read in the entire file
    
    xml_ns(page)  #show namespaces
    
    #extract the database
    db <- xml_find_first(page, ".//pdfx_1_:Dataframe") 
    #convert to text and strip leading whitespace
    dbtext <- xml_text(db) %>% trimws()
    
    #read the text in and convert to xml
    xml_db <- read_xml(dbtext)
    xml_ns(xml_db)  #show namespaces
    
    #extract the requested information from database
    #shown here for demonstration purposes
    xml_db %>% xml_find_all(".//d1:column") %>% xml_find_all(".//d1:value") %>% xml_text()