I'm parsing XML docs from PubMed Central and sometimes I find paragraphs with nested tables like the example below. Is there a way in R to get the text and exclude the table?
doc <- xmlParse("<sec><p>Text</p>
<p><em>More</em> text<table>
<tr><td>SKIP</td><td>this</td></tr>
</table></p>
</sec>")
xpathSApply(doc, "//sec/p", xmlValue)
[1] "Text" "More textSKIPthis"
I'd like to return paragraphs without the nested table rows.
[1] "Text" "More text"
You can remove the nodes you dont want. In this example I remove nodes given by the XPATH //sec/p/table
library(XML)
doc <- xmlParse("<sec><p>Text</p>
<p>More text<table>
<tr><td>SKIP</td><td>this</td></tr>
</table></p>
</sec>")
xpathSApply(doc, "//sec/p/table", removeNodes)
xpathSApply(doc, "//sec/p", xmlValue)
[1] "Text" "More text"
If you want to keep your doc
intact you could also consider:
library(XML)
doc <- xmlParse("<sec><p>Text</p>
<p>More text<table>
<tr><td>SKIP</td><td>this</td></tr>
</table></p>
</sec>")
> xpathSApply(doc, "//sec/p/node()[not(self::table)]", xmlValue)
[1] "Text" "More text"
or simply:
xpathSApply(doc, "//sec/p/text()", xmlValue)
[1] "Text" "More text"
which is best will depend on the complexity of your real world case.