Compile NodeSet based on a condition (xml2)

I am trying to select a nodeset and extract text from a child node. However, the source XML does not have rigid structure. The item can be of two types and the text to be extracted in two different nodes. Simplified example below.

<p_item>
    <id>id1</id>
</p_item>
<e_item>
    <e_id>id2</e_id>
</e_item>
<p_item>
    <id>id3</id>>
    <e_id>id3</e_id>
</p_item>

Some p_items contain both id and e_id. If I select all items (p_item + e_item), for some p_item I got two id. I want to get a single id per item, so that I can bind resulting character vectors to a dataframe. I would like to use pipe semantic, loop over the items and compile nodeset as following: if item is p_item extract id if item is e_item extract e_id if item is p_item has both id + e_id, extract id only

I was not able to figure out how to use purrr::map to compile the nodeset. In the last step I want to use

xml_find_all(id | e_id) %>% 
xml_text()

and bind same length character vectors to a final dataframe. Has somebody experience with similar problem? Thank you for sharing your knowledge.

Solution

Not sure where purrr comes into question as currently stated. You can write your requirement as a css selector list based on the type selector values of interest and specifying the relationship with a combinator e.g. descendent combinator. The , in the selector list allows for OR selection where either left or right pattern can be matched.

library(rvest)

html <- '<p_item>
    <id>id1</id>
</p_item>
<e_item>
    <e_id>id2</e_id>
</e_item>
<p_item>
    <id>id3</id>>
    <e_id>id3</e_id>
</p_item>'

page <- read_html(html)

page |> html_elements('p_item id, e_item e_id') |> html_text()

I suppose you might use purrr if wanting to deal with potentially missing child nodes e.g.

html2 <- '
<p_item>
    <id>id1</id>
</p_item>
<p_item>
    <unknown>not_me</unknown>
</p_item>
<e_item>
    <e_id>id2</e_id>
</e_item>
<p_item>
    <id>id3</id>>
    <e_id>id3</e_id>
</p_item>'

library(purrr)

page2 <- read_html(html2)

purrr::map_chr(page2 |> html_elements('p_item, e_item'), ~ .x |> html_element('id, e_id') |> html_text())

But no additional libraries are needed in this case. You could use an *apply e.g.

sapply(page2 |> html_elements('p_item, e_item'), function(x) html_element(x, 'id, e_id') |> html_text())