Search code examples
rregexrcurl

Webscraping using Rcurl


A situation in which we want to know the 10 most frequent posters to the R-help list serve for january 2014, I have used getURL to retrieve data from the ETHZ secure site.

  library("RCurl")
    library("XML")
     jan14 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html",
                       ssl.verifypeer = FALSE)
 1)how can I parse jan14 file using htmltreeparse().
 2)how can I use the regular expressions to pull out the author lines and delete unwanted characters in the lines.

Solution

  • Retrieve the file. We must use getURL() because the schema is https:, otherwise we could have used doc <- htmlParse(url) directly.

    url <- "https://stat.ethz.ch/pipermail/r-help/2009-January/date.html"
    jan14 <- getURL(url, ssl.verifypeer = FALSE)
    

    htmlParse() parses the text that we have just retrieved. It is the same as htmlTreeParse(), but easier to type.

    doc <- htmlParse(jan14, asText=TRUE)
    

    We do not need a regular expression to parse the text file; this would be error-prone and difficult. Instead we use XPath to identify the text value of italicized items inside lists; this is where the author names appear in the html.

    who <- sapply(doc["//li/i/text()"], xmlValue)
    

    who is a character vector of contributor names; the only 'unwanted' characters are white space characters (including new lines) at the end of each element. A regular expression matching one or more white space characters at the end of a character vector is [[:space:]]+$; we can use sub() to replace each occurrence with nothing (""). The table() function creates a table that counts how many times each author contributed. sort() takes this result and orders these from least to most frequent contributor. tail() returns the last (6 by default, we specify 10) entries.

    tail(sort(table(sub("[[:space:]]+$", "", who))), 10)
    

    The result is

    > tail(sort(table(sub("[[:space:]]+$", "", who))), 10)
    
               Greg Snow Henrique Dallazuanna       hadley wickham 
                      35                   36                   40 
           Marc Schwartz    Wacek Kusnierczyk          jim holtman 
                      48                   55                   80 
          Duncan Murdoch    Prof Brian Ripley      David Winsemius 
                      84                   84                   93 
      Gabor Grothendieck 
                     116