Search code examples
rurlhref

How to change the a href into an URL using R?


How can I change the href into a meaningful URL using R? By meaningful I understand an address that if paste to a browser will open correctly.

For example:

<a href="../../systemfit/html/systemfit.html">systemfit</a>

read from: http://artax.karlin.mff.cuni.cz/r-help/library/systemfit/html/systemfit.control.html

into: http://artax.karlin.mff.cuni.cz/r-help/library/systemfit/html/systemfit.html

What I do is:

collectLinks <- function(x){
library(stringi)
fileUrl <- (x)
html <- paste(readLines(fileUrl, warn=FALSE), collapse="\n")
matched <- stri_match_all_regex(html, "<a href=\"(.*?)\"")
matched[[1]][, 2]
}

links <- collectLinks("http://artax.karlin.mff.cuni.cz/r-help/library/systemfit/html/systemfit.control.html")

Function collectLinks takes a character string which contains an URL as an input. It returns a character vector of href content which is found on x.

What I would like to do next is to go through every element in links and extract the href content from it. However:

[1] "../../systemfit/html/systemfit.html"      "../../systemfit/html/solve.html"      
[3] "../../systemfit/html/det.html"         "../../systemfit/html/systemfit.html"  
[5] "mailto:arne.henningsen@googlemail.com" "../../systemfit/html/systemfit.html"  
[7] "00Index.html"  

are not meaningful URLs.

readLines(links[1])
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
cannot open file '../../systemfit/html/systemfit.html': No such file or    directory

I wonder if there is an universal way that allows to convert the a href content into meaningful URL which can be further exploit?


Solution

  • library(XML)
    k1<-getHTMLLink("http://artax.karlin.mff.cuni.cz/r-help/library/systemfit/html/systemfit.control.html")
    #k1[6] is what you are looking for:
    >k1[6]
    [1] "../../systemfit/html/systemfit.html"
    k2<-htmlParse(sub("../..", "http://artax.karlin.mff.cuni.cz/r-help/library",k1[6]))