Search code examples
rtidyverserseleniumwebautomation

R selenium method for downloading data for different year


The following code was written in one of my previous questions on same platform. I have to download data for 2020 however the URL doesn't change once 2020 is clicked on right hand column on given weblink. The code is opening 2020 page on Firefox however is not downloading required files (2020 files) in the system. It is downloading the 2021 files which I don't need. I am unable to figure out the issue. The URL I am working on is : https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy The code is:

 library(tidyverse)
      library(stringr)
      library(purrr)
      library(rvest)
      library(RSelenium)
      
      rD <- rsDriver(browser="firefox", port=4567L, verbose=F)
      remDr <- rD[["client"]]
      
      remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook+of+Statistics+on+Indian+Economy")
elem<- remDr$findElement(using = "link text", "2020")
      elem$clickElement()
      page <- remDr$getPageSource()[[1]]
      read_html(page) -> html
      html %>%
        html_nodes("a") %>%  
        html_attr("href") %>% 
        str_subset("\\.PDF") -> urls
      urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\\.PDF") -> filenames
    
      for(u in 1:length(urls)) 
      {
        cat(paste('downloading: ', u, ' of ', length(urls)))
        download.file(urls[u], filenames[u], mode='wb')
      }
      system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

Solution

  • You can check whether you are on right page or not by obtaining the name of Handbook,

    You are now on year 2021

    remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook+of+Statistics+on+Indian+Economy")
    remDr$getPageSource()[[1]] %>% 
      read_html() %>% html_nodes(xpath = '//*[@id="accordion"]/table[2]/tbody/tr[2]/td[1]/text()[1]') %>% html_text()
    [1] "Handbook of Statistics on the Indian Economy, 2020-21 "
    

    Now you are year 2020

    elem<- remDr$findElement(using = "link text", "2020")
    elem$clickElement()
    remDr$getPageSource()[[1]] %>% 
      read_html() %>% html_nodes(xpath = '//*[@id="accordion"]/table[2]/tbody/tr[2]/td[1]/text()[1]') %>% html_text()
      [1] "Handbook of Statistics on Indian Economy 2019-20 "
    

    Edit:-

    library(tidyverse)
    library(rvest)
    library(RSelenium)
    

    Launch Browser

    rD <- rsDriver(browser="firefox", port=4567L, verbose=F)
    remDr <- rD[["client"]]
    

    Load the webpage

    remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook+of+Statistics+on+Indian+Economy")
    

    Click the year 2020. Make sure you are getting the data for the right year.

    remDr$findElement(using = "link text", "2020")$clickElement()
    remDr$getPageSource()[[1]] %>% 
      read_html() %>% html_nodes(xpath = '//*[@id="accordion"]/table[2]/tbody/tr[2]/td[1]/text()[1]') %>% html_text()
    
    [1] "Handbook of Statistics on the Indian Economy, 2020-21 "
    

    Get the pdf name, url and start downloading

    urls =  remDr$getPageSource()[[1]] %>% read_html() %>% 
      html_nodes("a") %>%  
      html_attr("href") %>% 
      str_subset("\\.PDF")
    
    filenames = urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\\.PDF")
    
    for(u in 1:length(urls)){
      cat(paste('downloading: ', u, ' of ', length(urls)))
      download.file(urls[u], filenames[u], mode='wb')
    }