Search code examples
rxmlxml2

xml in R, remove paragraphs but keep xml class


I am trying to remove some paragraphs from an XML document in R, but I want to keep the XML structure/class. Here's some example text and my failed attempts:

library(xml2)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
xml_find_all(text, './/caption//p') %>% xml_remove() # deletes text
xml_find_all(text, './/caption//p') %>% xml_text() # removes paragraphs but also XML structure

Here's what I would like to end up with (just the paragraphs in the caption removed):

ideal_text = read_xml("<paper> <caption>The main title A sub title</caption> <p>The opening paragraph.</p> </paper>")
ideal_text

Solution

  • It looks like this requires multiple steps. Find the node, copy the text, remove the contents of the node and then update.

    library(xml2)
    library(magrittr)
    
    text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
    
    # find the caption
    caption <- xml_find_all(text, './/caption')
    
    #store existing text
    replacemement<- caption %>% xml_find_all( './/p') %>% xml_text() %>% paste(collapse = " ")
    
    #remove the desired text
    caption %>% xml_find_all( './/p') %>% xml_remove()
    
    #replace the caption
    xml_text(caption) <- replacemement
    text  #test
        
    {xml_document}
    <paper>
       [1] <caption>The main title A sub title</caption>
       [2] <p>The opening paragraph.</p>
    

    Most likely you will need to obtain the vector/list of caption nodes and then step through them one-by-one with a loop.