Search code examples
xpathrvestxml2

Using rvest::find_element with xpath for html document embedded with xml


I am trying to traverse an HTML/XML document scraped from the SEC website. It's a company filing (form 10-Q). The document has XML tags and attributes, but it is an HTML document. If I read the document as HTML, then xpath does not work when looking up non-HTML attributes (e.g. contextRef). If I read the document as XML, then I can look up non-HTML attributes, but I can't traverse the document using the explicit path.

Here's a working example.

req <- httr2::request("https://www.sec.gov/Archives/edgar/data/1065280/000106528022000368/nflx-20220930.htm") |>
  httr2::req_headers(`User-Agent` = "Me [email protected]")
resp <- httr2::req_perform(req)
xml_doc <- httr2::resp_body_xml(resp, check_type = FALSE)
html_doc <- httr2::resp_body_html(resp)

rvest::html_element(xml_doc, xpath = "/html/body/div")
rvest::html_element(xml_doc, xpath = "//*[@contextRef]")
rvest::html_element(html_doc, xpath = "/html/body/div")
rvest::html_element(html_doc, xpath = "//*[@contextRef]")

Any recommendations that would allow me to both traverse the explicit path and search for attributes would be helpful.


Solution

  • It's actually an XHTML document. It's best to treat it as XML, but you will need to recognize that the elements such as "html" and "body" are in the XHTML namespace and prefix them accordingly in your XPath expressions, for example /h:html/h:body/h:p where the prefix h is bound to the namespace URI http://www.w3.org/1999/xhtml.

    (I don't know the rvest API so I can't advise you how to set up the namespace bindings.)