Search code examples
rxmlxpathxml-parsing

Parsing an XML with missing content


I have a XML like this:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
<body>
<head rend="Body A">DOCUMENT_TITLE</head>
<div rend="entry">
<head rend="time">TIME_1</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content"> <hi rend="italic"> CONTENT1 </hi> </p>
</div>
<div rend="entry">
<head rend="time">TIME_2</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="Body A"> INFORMATION A</p>
</div>
<div rend="entry">
<head rend="time">TIME_3</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT3 </hi>
</p>
<div rend="entry">
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT4 </hi>
</p>
</div>
</body>
</text>
</TEI>

... with many missing arguments, but I would like to obtain a data.frame with a line for each "div" like the following one:

div time content
1 time1 content1
2 time2 NA
3 time3 content3
4 NA content4

with NA when the argument is missing.

I try an approach like this one

data_xml <- read_xml(xmlfile)
div <-xml_find_all(data_xml, xpath = ".//div")
df <- tibble::tibble(
  date = div %>% xml_text(),
  content = div %>% xml_find_first('./p[@rend="content"/hi[@rend="italic"]]') %>% xml_text()
)

but the xml_find_all does indeed return an empty list. Following some suggestions I try this way, actually working

doc <- htmlParse(xmlfile)

div <- getNodeSet(doc, '//div')
dates<- xpathSApply(doc,'//div/text()',xmlValue)
abstracts<-unlist(xpathSApply(doc,'//p[@rend="content"]//hi[@rend="italic"]',xmlValue))

I correctly obtained the strings I wanted BUT I lost the correspondency, since many div have no content or no head with time information (meaning that div, dates, abstracts have different lengths). Any suggestions? TIA


Solution

  • 1) The input shown is malformed so read_xml will give an error. Since the question indicates it works there must have been a transcription error in moving the XML to the question. We have added a close div tag before the 4th opening div tag in the Note at the end.

    Since the XML uses a namespace, first strip that using xml_ns_strip to avoid problems. Then form the appropriate xpath expression producing the needed nodes and convert those to dcf format (which is a name:value format where each field is on a separate line and a blank line separates records -- see ?read.dcf for details) in variable dcf. Read that using read.dcf, convert the resulting character matrix to data frame and fix up the div entries.

    library(dplyr)
    library (xml2)
    
    doc <- read_xml(Lines) %>% xml_ns_strip() # Lines in Note below
    
    nodes <- doc %>%
      xml_find_all('//div | //head[@rend="time"] | //hi[@rend="italic"]')
    
    dcf <- case_match(xml_name(nodes),
      "div" ~ "\ndiv:",
      "hi" ~ paste0("time:", xml_text(nodes)),
      .default = paste0("content:", xml_text(nodes))
    )
    
    dcf %>%
      textConnection() %>%
      read.dcf() %>%
      as.data.frame() %>%
      mutate(div = row_number())
    

    giving

      div   time  content
    1   1 TIME_1 CONTENT1
    2   2 TIME_2     <NA>
    3   3 TIME_3 CONTENT3
    4   4   <NA> CONTENT4
    

    2) Another way is to use a double xml_find_all. The first creates a node set and the second creates a list of node sets, with one component per record because flatten=FALSE. These are then reformed into a data frame.

    library(purrr)
    doc %>%
      xml_find_all('//div') %>%
      xml_find_all(".//head | .//hi", flatten = FALSE) %>%
      map_df(~ setNames(xml_text(.x, TRUE), xml_name(.x))) %>%
      reframe(div = row_number(), time = head, content = hi)
    ## # A tibble: 4 × 3
    ##     div time   content 
    ##   <int> <chr>  <chr>   
    ## 1     1 TIME_1 CONTENT1
    ## 2     2 TIME_2 <NA>    
    ## 3     3 TIME_3 CONTENT3
    ## 4     4 <NA>   CONTENT4
    

    3) This third alternative is a bit closer to the attempt in the question except it uses xml_find_first separately for each column.

    column <- function(start, xpath) {
      start %>% xml_find_first(xpath) %>% xml_text(TRUE)
    }
    
    div_nodes <- doc %>% xml_find_all('//div')
    tibble(div = seq_along(div_nodes),
           time = column(div_nodes, ".//head"),
           content = column(div_nodes, ".//hi")
    ) 
    ## # A tibble: 4 × 3
    ##     div time   content 
    ##   <int> <chr>  <chr>   
    ## 1     1 TIME_1 CONTENT1
    ## 2     2 TIME_2 <NA>    
    ## 3     3 TIME_3 CONTENT3
    ## 4     4 <NA>   CONTENT4
    

    Note

    Lines <- '<?xml version="1.0" encoding="UTF-8"?>
    <TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader> ... </teiHeader>
    <text>
    <body>
    <head rend="Body A">DOCUMENT_TITLE</head>
    <div rend="entry">
    <head rend="time">TIME_1</head>
    <p rend="Body A"> INFORMATION A</p>
    <p rend="content"> <hi rend="italic"> CONTENT1 </hi> </p>
    </div>
    <div rend="entry">
    <head rend="time">TIME_2</head>
    <p rend="Body A"> INFORMATION A</p>
    <p rend="Body A"> INFORMATION A</p>
    </div>
    <div rend="entry">
    <head rend="time">TIME_3</head>
    <p rend="Body A"> INFORMATION A</p>
    <p rend="content">
    <hi rend="italic"> CONTENT3 </hi>
    </p>
    </div>
    <div rend="entry">
    <p rend="Body A"> INFORMATION A</p>
    <p rend="content">
    <hi rend="italic"> CONTENT4 </hi>
    </p>
    </div>
    </body>
    </text>
    </TEI>'