A situation in which we want to know the 10 most frequent posters to the R-help list serve for january 2014, I have used getURL to retrieve data from the ETHZ secure site.
library("RCurl") library("XML") jan14 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html", ssl.verifypeer = FALSE)
1)how can I parse jan14 file using htmltreeparse().
2)how can I use the regular expressions to pull out the author lines and delete unwanted characters in the lines.
Retrieve the file. We must use getURL()
because the schema is https:, otherwise we could have used doc <- htmlParse(url)
directly.
url <- "https://stat.ethz.ch/pipermail/r-help/2009-January/date.html"
jan14 <- getURL(url, ssl.verifypeer = FALSE)
htmlParse()
parses the text that we have just retrieved. It is the same as htmlTreeParse()
, but easier to type.
doc <- htmlParse(jan14, asText=TRUE)
We do not need a regular expression to parse the text file; this would be error-prone and difficult. Instead we use XPath to identify the text value of italicized items inside lists; this is where the author names appear in the html.
who <- sapply(doc["//li/i/text()"], xmlValue)
who
is a character vector of contributor names; the only 'unwanted' characters are white space characters (including new lines) at the end of each element. A regular expression matching one or more white space characters at the end of a character vector is [[:space:]]+$
; we can use sub()
to replace each occurrence with nothing (""
). The table()
function creates a table that counts how many times each author contributed. sort()
takes this result and orders these from least to most frequent contributor. tail()
returns the last (6 by default, we specify 10) entries.
tail(sort(table(sub("[[:space:]]+$", "", who))), 10)
The result is
> tail(sort(table(sub("[[:space:]]+$", "", who))), 10)
Greg Snow Henrique Dallazuanna hadley wickham
35 36 40
Marc Schwartz Wacek Kusnierczyk jim holtman
48 55 80
Duncan Murdoch Prof Brian Ripley David Winsemius
84 84 93
Gabor Grothendieck
116