Search code examples
rxml-parsingweb-scrapinghttr

Why different results with XPath 1.0 and RCurl vs httr, using substring-before an expression


When I use XPath 1.0's substring-before or -after in an expression, something happens that makes my subsequent xmlValue call throw an error. The code below shows that the XPath expression works fine with httr, but then doesn't work with RCurl.

require(XML)
require(httr)
doc <- htmlTreeParse("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp", useInternal = TRUE)
(string <- xpathSApply(doc, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')", xmlValue, trim = TRUE))


require(RCurl)
fetch <- GET("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")
contents <- content(fetch)
locsnodes <- getNodeSet(contents, "//div[@id = 'contactInformation']//p")  
sapply(locsnodes, xmlValue)

[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n Phone: 432-897-1440\r\n Toll Free: 866-721-6665\r\n Fax: 432-682-3672"

The code above works OK, but I want to use substring-before it to clean up the result like this:

[1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "

locsnodes <- getNodeSet(contents, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')")  
sapply(locsnodes, xmlValue)

Error in UseMethod("xmlValue") : 
  no applicable method for 'xmlValue' applied to an object of class "character"

How can I use substring- and also RCurl, because RCurl is the chosen package for a more complicate operation used later?

Thank you for any guidance (or better way to achieve what I want


Solution

  • The fun argument in xpathSApply or indeed getNodeSet is only called if a node set is returned. In your case a character string is being returned and the function is ignored:

    require(XML)
    require(RCurl)
    doc <- htmlParse("http://www.cottonbledsoe.com/CM/Custom/TOCContactUs.asp")
    locsnodes <- getNodeSet(doc
                            , "substring-before(//div[@id = 'contactInformation']//p, 'Phone')")  
    > locsnodes
    [1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "
    
    > str(locsnodes)
     chr "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "
    

    The fun argument is not being used here in xpathSApply

    > xpathSApply(doc, "substring-before(//div[@id = 'contactInformation']//p, 'Phone')"
    +             , function(x){1}
    + )
    [1] "500 West Illinois, Suite 300\r\n Midland, Texas 79701\r\n "
    

    as your xpath is not returning a node set.