I have the following XML file:
<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:pdfx_1_="ns.adobe.org/pdfx/1.3/">
<xmp:CreateDate>2021-05-30T11:17:35+02:00</xmp:CreateDate>
<xmp:CreatorTool>TeX</xmp:CreatorTool>
<xmp:ModifyDate>2021-05-30T12:12:25+02:00</xmp:ModifyDate>
<xmp:MetadataDate>2021-05-30T12:12:25+02:00</xmp:MetadataDate>
<pdfx:PTEX.Fullbanner>This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) kpathsea version 6.3.1</pdfx:PTEX.Fullbanner>
<pdf:Producer>pdfTeX-1.40.20</pdf:Producer>
<pdf:Trapped>Unknown</pdf:Trapped>
<pdf:Keywords/>
<dc:format>application/pdf</dc:format>
<xmpMM:DocumentID>uuid:38d0617c-0385-5941-a87d-cc4a1e54bd76</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:d056c61c-55c6-5f44-8c0e-fe6e911c2ed9</xmpMM:InstanceID>
<pdfwe:dafra>

<?xml version="1.0"?>
	<dataframe name="expData" 
		xmlns="url"
		xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
		xsi:schemaLocation="url">
		<column name="DATA" type="ratio">
			<value>14</value>
			<value>18</value>
			<value>21</value>
			<value>35</value>
			<value>44</value>
			<value>50</value>
			<value>3</value>
			<value>5</value>
			<value>7</value>
		</column>
	</dataframe>

			</pdfx_1_:Dataframe>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
As you can see, the tag Dataframe
of the namespace pdfwe
have inside it another XML. I need to extract this XML and convert it to a normal XML with no ASCII Entity Names like the following:
<?xml version="1.0"?>
<dataframe name="expData"
xmlns="url"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="url">
<column name="DATA" type="ratio">
<value>14</value>
<value>18</value>
<value>21</value>
<value>35</value>
<value>44</value>
<value>50</value>
<value>3</value>
<value>5</value>
<value>7</value>
</column>
</dataframe>
To extract what's inside pdfwe:dafra
I'm using the function xml_find_all(x, ".//pdfwe:dafra")
of the xml2
package but I'm not getting the result I want.
To convert the Entity Names I'm using the function xml2::xml_text(xml2::read_xml(paste0("<x>", md, "</x>")))
but I'm not getting the results I want either.
Thanks in advance!
The solution is a multi step process, extract the database node, convert to text, clean up and then convert back to xml with the read_xml()
function.
library(xml2)
page <- read_xml('<?xpacket begin="???" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08">
.....') #read in the entire file
xml_ns(page) #show namespaces
#extract the database
db <- xml_find_first(page, ".//pdfx_1_:Dataframe")
#convert to text and strip leading whitespace
dbtext <- xml_text(db) %>% trimws()
#read the text in and convert to xml
xml_db <- read_xml(dbtext)
xml_ns(xml_db) #show namespaces
#extract the requested information from database
#shown here for demonstration purposes
xml_db %>% xml_find_all(".//d1:column") %>% xml_find_all(".//d1:value") %>% xml_text()