Search code examples
htmlrscreen-scrapinghtml-content-extraction

How can I read and parse the contents of a webpage in R


I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it


Solution

  • Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.

    Here's an example to get you started:

    require(RCurl)
    require(XML)
    webpage <- getURL("http://www.haaretz.com/")
    webpage <- readLines(tc <- textConnection(webpage)); close(tc)
    pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
    # parse the tree by tables
    x <- xpathSApply(pagetree, "//*/table", xmlValue)  
    # do some clean up with regular expressions
    x <- unlist(strsplit(x, "\n"))
    x <- gsub("\t","",x)
    x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
    x <- x[!(x %in% c("", "|"))]
    

    This results in a character vector of mostly just webpage text (along with some javascript):

    > head(x)
    [1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"           
    [4] "  Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()"