read_html()
usually returns all the page html for a given url.
But when I try on this url, I can see that not all of the page is returned.
Why is this (and more importantly, how do I fix it)?
page_html <- "https://raw.githubusercontent.com/mjaniec2013/ExecutionTime/master/ExecutionTime.R" %>%
read_html
page_html %>% html_text %>% cat
# We can see not all the page html has been retrieved
# And just to be sure
page_html %>% as.character
Nokogiri
library. It gives exactly the same result as read_html
. So it looks like it's not something that's specific to R or read_html()
This looks like it's treating the assignment operator in the page as an unclosed tag.
fakepage <- "<html>the text after <- will be lost</html>"
read_html(fakepage) %>%
html_text()
[1] "the text after "
As the page you're after is a plain text file, you can use readr::read_file()
in this instance.
readr::read_file("https://raw.githubusercontent.com/mjaniec2013/ExecutionTime/master/ExecutionTime.R")