The following code was written in one of my previous questions on same platform. I have to download data for 2020 however the URL doesn't change once 2020 is clicked on right hand column on given weblink. The code is opening 2020 page on Firefox however is not downloading required files (2020 files) in the system. It is downloading the 2021 files which I don't need. I am unable to figure out the issue. The URL I am working on is : https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook%20of%20Statistics%20on%20Indian%20Economy The code is:
library(tidyverse)
library(stringr)
library(purrr)
library(rvest)
library(RSelenium)
rD <- rsDriver(browser="firefox", port=4567L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook+of+Statistics+on+Indian+Economy")
elem<- remDr$findElement(using = "link text", "2020")
elem$clickElement()
page <- remDr$getPageSource()[[1]]
read_html(page) -> html
html %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\\.PDF") -> urls
urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\\.PDF") -> filenames
for(u in 1:length(urls))
{
cat(paste('downloading: ', u, ' of ', length(urls)))
download.file(urls[u], filenames[u], mode='wb')
}
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
You can check whether you are on right page or not by obtaining the name of Handbook,
You are now on year 2021
remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook+of+Statistics+on+Indian+Economy")
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[@id="accordion"]/table[2]/tbody/tr[2]/td[1]/text()[1]') %>% html_text()
[1] "Handbook of Statistics on the Indian Economy, 2020-21 "
Now you are year 2020
elem<- remDr$findElement(using = "link text", "2020")
elem$clickElement()
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[@id="accordion"]/table[2]/tbody/tr[2]/td[1]/text()[1]') %>% html_text()
[1] "Handbook of Statistics on Indian Economy 2019-20 "
library(tidyverse)
library(rvest)
library(RSelenium)
Launch Browser
rD <- rsDriver(browser="firefox", port=4567L, verbose=F)
remDr <- rD[["client"]]
Load the webpage
remDr$navigate("https://www.rbi.org.in/scripts/AnnualPublications.aspx?head=Handbook+of+Statistics+on+Indian+Economy")
Click the year 2020. Make sure you are getting the data for the right year.
remDr$findElement(using = "link text", "2020")$clickElement()
remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes(xpath = '//*[@id="accordion"]/table[2]/tbody/tr[2]/td[1]/text()[1]') %>% html_text()
[1] "Handbook of Statistics on the Indian Economy, 2020-21 "
Get the pdf name, url and start downloading
urls = remDr$getPageSource()[[1]] %>% read_html() %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\\.PDF")
filenames = urls %>% str_split(.,'/') %>% unlist() %>% str_subset("\\.PDF")
for(u in 1:length(urls)){
cat(paste('downloading: ', u, ' of ', length(urls)))
download.file(urls[u], filenames[u], mode='wb')
}