I have a XML like this:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
<body>
<head rend="Body A">DOCUMENT_TITLE</head>
<div rend="entry">
<head rend="time">TIME_1</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content"> <hi rend="italic"> CONTENT1 </hi> </p>
</div>
<div rend="entry">
<head rend="time">TIME_2</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="Body A"> INFORMATION A</p>
</div>
<div rend="entry">
<head rend="time">TIME_3</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT3 </hi>
</p>
<div rend="entry">
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT4 </hi>
</p>
</div>
</body>
</text>
</TEI>
... with many missing arguments, but I would like to obtain a data.frame with a line for each "div" like the following one:
div | time | content |
---|---|---|
1 | time1 | content1 |
2 | time2 | NA |
3 | time3 | content3 |
4 | NA | content4 |
with NA when the argument is missing.
I try an approach like this one
data_xml <- read_xml(xmlfile)
div <-xml_find_all(data_xml, xpath = ".//div")
df <- tibble::tibble(
date = div %>% xml_text(),
content = div %>% xml_find_first('./p[@rend="content"/hi[@rend="italic"]]') %>% xml_text()
)
but the xml_find_all
does indeed return an empty list.
Following some suggestions I try this way, actually working
doc <- htmlParse(xmlfile)
div <- getNodeSet(doc, '//div')
dates<- xpathSApply(doc,'//div/text()',xmlValue)
abstracts<-unlist(xpathSApply(doc,'//p[@rend="content"]//hi[@rend="italic"]',xmlValue))
I correctly obtained the strings I wanted BUT I lost the correspondency, since many div have no content or no head with time information (meaning that div, dates, abstracts have different lengths). Any suggestions? TIA
1) The input shown is malformed so read_xml
will give an error. Since the question indicates it works there must have been a transcription error in moving the XML to the question. We have added a close div tag before the 4th opening div tag in the Note at the end.
Since the XML uses a namespace, first strip that using xml_ns_strip
to avoid problems. Then form the appropriate xpath expression producing the needed nodes and convert those to dcf format (which is a name:value format where each field is on a separate line and a blank line separates records -- see ?read.dcf
for details) in variable dcf
. Read that using read.dcf
, convert the resulting character matrix to data frame and fix up the div entries.
library(dplyr)
library (xml2)
doc <- read_xml(Lines) %>% xml_ns_strip() # Lines in Note below
nodes <- doc %>%
xml_find_all('//div | //head[@rend="time"] | //hi[@rend="italic"]')
dcf <- case_match(xml_name(nodes),
"div" ~ "\ndiv:",
"hi" ~ paste0("time:", xml_text(nodes)),
.default = paste0("content:", xml_text(nodes))
)
dcf %>%
textConnection() %>%
read.dcf() %>%
as.data.frame() %>%
mutate(div = row_number())
giving
div time content
1 1 TIME_1 CONTENT1
2 2 TIME_2 <NA>
3 3 TIME_3 CONTENT3
4 4 <NA> CONTENT4
2) Another way is to use a double xml_find_all
. The first creates a node set and the second creates a list of node sets, with one component per record because flatten=FALSE
. These are then reformed into a data frame.
library(purrr)
doc %>%
xml_find_all('//div') %>%
xml_find_all(".//head | .//hi", flatten = FALSE) %>%
map_df(~ setNames(xml_text(.x, TRUE), xml_name(.x))) %>%
reframe(div = row_number(), time = head, content = hi)
## # A tibble: 4 × 3
## div time content
## <int> <chr> <chr>
## 1 1 TIME_1 CONTENT1
## 2 2 TIME_2 <NA>
## 3 3 TIME_3 CONTENT3
## 4 4 <NA> CONTENT4
3) This third alternative is a bit closer to the attempt in the question except it uses xml_find_first
separately for each column.
column <- function(start, xpath) {
start %>% xml_find_first(xpath) %>% xml_text(TRUE)
}
div_nodes <- doc %>% xml_find_all('//div')
tibble(div = seq_along(div_nodes),
time = column(div_nodes, ".//head"),
content = column(div_nodes, ".//hi")
)
## # A tibble: 4 × 3
## div time content
## <int> <chr> <chr>
## 1 1 TIME_1 CONTENT1
## 2 2 TIME_2 <NA>
## 3 3 TIME_3 CONTENT3
## 4 4 <NA> CONTENT4
Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
<body>
<head rend="Body A">DOCUMENT_TITLE</head>
<div rend="entry">
<head rend="time">TIME_1</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content"> <hi rend="italic"> CONTENT1 </hi> </p>
</div>
<div rend="entry">
<head rend="time">TIME_2</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="Body A"> INFORMATION A</p>
</div>
<div rend="entry">
<head rend="time">TIME_3</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT3 </hi>
</p>
</div>
<div rend="entry">
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT4 </hi>
</p>
</div>
</body>
</text>
</TEI>'