Consider this simple example
example_xml <- '<?xml version="1.0" encoding="UTF-8"?>
<file>
<book>
<text>abracadabra</text>
<node></node>
</book>
<book>
<text>hello world</text>
<node></node>
</book>
</file>'
myxml <- xml2::read_xml(example_xml)
Now, running this works as expected
> myxml %>% xml_find_all('//book')
{xml_nodeset (2)}
[1] <book>\n <text>abracadabra</text>\n <node/>\n</book>
[2] <book>\n <text>hello world</text>\n <node/>\n</book>
but looking for nodes whose text
attribute contain wor
does not.
> myxml %>% xml_find_all('//book[contains(@text, "wor")]')
{xml_nodeset (0)}
What is the problem here? How can I use regex (or partial string matching) with xml2
?
Thanks!
The //book[contains(@text, "wor")]
XPath finds book
nodes that contain a text
attribute (@
specifies an attribute) that contain wor
in their values.
Your XML does not contain elements like <book text="Hello world">Title</book>
, thus there are no results.
You may get the book nodes that contain wor
in their text nodes using
> xml_find_all(myxml, '//book[contains(., "wor")]')
{xml_nodeset (1)}
[1] <book>\n <text>hello world</text>\n <node/>\n</book>
If you are fine with just text
nodes as the return values, you may use
> xml_find_all(myxml, '//book/text[contains(., "wor")]')
{xml_nodeset (1)}
[1] <text>hello world</text>
If you need to get all book
parents that contain any child nodes with wor
text inside, use
> xml_find_all(myxml, '//*[contains(., "wor")]/parent::book')
{xml_nodeset (1)}
[1] <book>\n <text>hello world</text>\n <node/>\n</book>
See this answer to learn more about the difference between text()
and .
. In short, [contains(., "wor")]
returns true if the string value of an element contains wor
.