Parsing an XML with missing content

I have a XML like this:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
<body>
<head rend="Body A">DOCUMENT_TITLE</head>
<div rend="entry">
<head rend="time">TIME_1</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content"> <hi rend="italic"> CONTENT1 </hi> </p>
</div>
<div rend="entry">
<head rend="time">TIME_2</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="Body A"> INFORMATION A</p>
</div>
<div rend="entry">
<head rend="time">TIME_3</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT3 </hi>
</p>
<div rend="entry">
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT4 </hi>
</p>
</div>
</body>
</text>
</TEI>

... with many missing arguments, but I would like to obtain a data.frame with a line for each "div" like the following one:

div	time	content
1	time1	content1
2	time2	NA
3	time3	content3
4	NA	content4

with NA when the argument is missing.

I try an approach like this one

data_xml <- read_xml(xmlfile)
div <-xml_find_all(data_xml, xpath = ".//div")
df <- tibble::tibble(
  date = div %>% xml_text(),
  content = div %>% xml_find_first('./p[@rend="content"/hi[@rend="italic"]]') %>% xml_text()
)

but the xml_find_all does indeed return an empty list. Following some suggestions I try this way, actually working

doc <- htmlParse(xmlfile)

div <- getNodeSet(doc, '//div')
dates<- xpathSApply(doc,'//div/text()',xmlValue)
abstracts<-unlist(xpathSApply(doc,'//p[@rend="content"]//hi[@rend="italic"]',xmlValue))

I correctly obtained the strings I wanted BUT I lost the correspondency, since many div have no content or no head with time information (meaning that div, dates, abstracts have different lengths). Any suggestions? TIA

Solution

1) The input shown is malformed so read_xml will give an error. Since the question indicates it works there must have been a transcription error in moving the XML to the question. We have added a close div tag before the 4th opening div tag in the Note at the end.

Since the XML uses a namespace, first strip that using xml_ns_strip to avoid problems. Then form the appropriate xpath expression producing the needed nodes and convert those to dcf format (which is a name:value format where each field is on a separate line and a blank line separates records -- see ?read.dcf for details) in variable dcf. Read that using read.dcf, convert the resulting character matrix to data frame and fix up the div entries.

library(dplyr)
library (xml2)

doc <- read_xml(Lines) %>% xml_ns_strip() # Lines in Note below

nodes <- doc %>%
  xml_find_all('//div | //head[@rend="time"] | //hi[@rend="italic"]')

dcf <- case_match(xml_name(nodes),
  "div" ~ "\ndiv:",
  "hi" ~ paste0("time:", xml_text(nodes)),
  .default = paste0("content:", xml_text(nodes))
)

dcf %>%
  textConnection() %>%
  read.dcf() %>%
  as.data.frame() %>%
  mutate(div = row_number())

giving

  div   time  content
1   1 TIME_1 CONTENT1
2   2 TIME_2     <NA>
3   3 TIME_3 CONTENT3
4   4   <NA> CONTENT4

2) Another way is to use a double xml_find_all. The first creates a node set and the second creates a list of node sets, with one component per record because flatten=FALSE. These are then reformed into a data frame.

library(purrr)
doc %>%
  xml_find_all('//div') %>%
  xml_find_all(".//head | .//hi", flatten = FALSE) %>%
  map_df(~ setNames(xml_text(.x, TRUE), xml_name(.x))) %>%
  reframe(div = row_number(), time = head, content = hi)
## # A tibble: 4 × 3
##     div time   content 
##   <int> <chr>  <chr>   
## 1     1 TIME_1 CONTENT1
## 2     2 TIME_2 <NA>    
## 3     3 TIME_3 CONTENT3
## 4     4 <NA>   CONTENT4

3) This third alternative is a bit closer to the attempt in the question except it uses xml_find_first separately for each column.

column <- function(start, xpath) {
  start %>% xml_find_first(xpath) %>% xml_text(TRUE)
}

div_nodes <- doc %>% xml_find_all('//div')
tibble(div = seq_along(div_nodes),
       time = column(div_nodes, ".//head"),
       content = column(div_nodes, ".//hi")
) 
## # A tibble: 4 × 3
##     div time   content 
##   <int> <chr>  <chr>   
## 1     1 TIME_1 CONTENT1
## 2     2 TIME_2 <NA>    
## 3     3 TIME_3 CONTENT3
## 4     4 <NA>   CONTENT4

Note

Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
<body>
<head rend="Body A">DOCUMENT_TITLE</head>
<div rend="entry">
<head rend="time">TIME_1</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content"> <hi rend="italic"> CONTENT1 </hi> </p>
</div>
<div rend="entry">
<head rend="time">TIME_2</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="Body A"> INFORMATION A</p>
</div>
<div rend="entry">
<head rend="time">TIME_3</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT3 </hi>
</p>
</div>
<div rend="entry">
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT4 </hi>
</p>
</div>
</body>
</text>
</TEI>'