Web crawler and save with txt format using R

I would like to cralwer the poems and save with txt from this link, here is some hints:

  1. create folders with name of poet,
  2. save the poems with txt format by clicking poems in the red circle one by one,
  3. file name should be poem titles with extension of txt.

I'm new on web crawler with R, someone could help? I'll appreciate your suggestions or helps.



Rcrawler(Website = '', no_cores = 4, no_conn = 4, Obeyrobots = TRUE)

page <- LinkExtractor(url = '', ExternalLInks=TRUE)



  • This requires quite a lot of knowledge pieces, that I don't think a beginner can connect together. So here is the code, I explained in the comments:

    pg <- read_html("")
    tbl <- pg %>% 
      html_nodes(xpath = "//table[@width='436']") %>% .[[2]] %>% # the table that has the info about poems and poets is the second one with width equals 436
      html_table(fill = T) %>% # there are blank lines in between poems' rows => need to set fill = T
      setNames(c("top", "poem", "poet")) %>%
      filter(! %>% # remove blank lines
        link = sapply(poem, function(x) {
            pg %>% html_node(xpath = paste0("//td/a[contains(., \"", x, "\")]")) %>% html_attr("href")
          ) # this is tricky. with each poem title, find the <a> tag has the text is the title and extract the href attribute
        }, USE.NAMES = F)
    dir <- "~/poems" # where do you wanna save the result
    for (poet in unique(tbl$poet)) dir.create(paste0(dir, "/", poet))
    for (i in 1:nrow(tbl)) {
      poem_content <- 
        read_html(tbl$link[i]) %>% # read the link page
        html_nodes(xpath = "//td/div[@style='padding-left:14px;padding-top:20px;font-family:Arial;font-size:13px;']/text()") %>%
        html_text(trim = T) # poem lines
      file_path <- paste0(dir, "/", tbl$poet[i], "/", tbl$poem[i], ".txt")
      writeLines(poem_content, con = file_path)