So a friend of mine has written over 800 articles in a food blog, and I am looking to extract all of these to PDFs so that I can bind them nicely and gift them to him. There are simply too many articles to use Chrome's "Save as PDF" manually, so I am looking for the crispest possible way to run through a loop that saves the sites in this format. I have a working solution, however, the final PDFs have ugly ads and cookie warning banners on every single page. I don't see this when I manually select "Print" as PDF on Chrome. Is there a way to pass settings to Chromium using pagedown to have it print without these elements? I've pasted my code below, with the website in question.
#Specifying the url for desired website to be scraped
url1 <- paste0('', '1', '/')
#Reading the HTML code from the website
webpage1 <- read_html(url1)
# Pull the links for all articles on George's initial author page
dat <- html_attr(html_nodes(webpage1, 'a'), "href") %>%
as_tibble() %>%
filter(str_detect(value, "([0-9]{4})")) %>%
unique() %>%
# Pull the links for all articles on George's 2nd-89th author page
for (i in 2:89) {
url <- paste0('', i, '/')
#Reading the HTML code from the website
webpage <- read_html(url)
links <- html_attr(html_nodes(webpage, 'a'), "href") %>%
as_tibble() %>%
filter(str_detect(value, "([0-9]{4})")) %>%
unique() %>%
dat <- bind_rows(dat, links) %>%
dat <- dat %>%
# form 1-link vector to test with
tocollect<- dat$link[1]
format = "pdf",
verbose = 0,
I would rather strip the page of all the elements you do not need (especially the scripts, whereas you want to keep the stylesheets), save as a temporary HTML and then print it. The written HTML file looks nice in the browser, I could not test the printing though:
for(l in articleUrls) {
a <- read_html(l)
xml_remove(a %>% xml_find_all("aside"))
xml_remove(a %>% xml_find_all("footer"))
xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'article-related mb20')]"))
xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'tags')]"))
xml_remove(a %>% xml2::xml_find_all("//script"))
xml_remove(a %>% xml_find_all("//*[contains(@class, 'ad box')]"))
xml_remove(a %>% xml_find_all("//*[contains(@class, 'newsletter-signup')]"))
xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer')]"))
xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer-sidebar')]"))
xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-footer')]"))
xml_remove(a %>% xml_find_all("//*[contains(@class, 'sticky-newsletter')]"))
xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-header')]"))
xml2::write_html(a, file = "currentArticle.html")
pagedown::chrome_print(input = "currentArticle.html")