Search code examples
xmlrncbi

Get XML paragraphs without nested tables


I'm parsing XML docs from PubMed Central and sometimes I find paragraphs with nested tables like the example below. Is there a way in R to get the text and exclude the table?

doc <- xmlParse("<sec><p>Text</p>
  <p><em>More</em> text<table>
   <tr><td>SKIP</td><td>this</td></tr>
  </table></p>
 </sec>")

xpathSApply(doc, "//sec/p", xmlValue)
[1] "Text"              "More textSKIPthis"

I'd like to return paragraphs without the nested table rows.

[1] "Text"      "More text"

Solution

  • You can remove the nodes you dont want. In this example I remove nodes given by the XPATH //sec/p/table

    library(XML)
    doc <- xmlParse("<sec><p>Text</p>
      <p>More text<table>
       <tr><td>SKIP</td><td>this</td></tr>
                    </table></p>
                    </sec>")
    
    
    xpathSApply(doc, "//sec/p/table", removeNodes)
    xpathSApply(doc, "//sec/p", xmlValue)
    [1] "Text"      "More text"
    

    If you want to keep your doc intact you could also consider:

    library(XML)
    doc <- xmlParse("<sec><p>Text</p>
      <p>More text<table>
       <tr><td>SKIP</td><td>this</td></tr>
                    </table></p>
                    </sec>")
    > xpathSApply(doc, "//sec/p/node()[not(self::table)]", xmlValue)
    [1] "Text"      "More text"
    

    or simply:

    xpathSApply(doc, "//sec/p/text()", xmlValue)
    [1] "Text"      "More text"
    

    which is best will depend on the complexity of your real world case.