Search code examples
rweb-scrapingrselenium

web scrape a progress bar in R


I am scraping different projects from the following website https://indiainvestmentgrid.gov.in/opportunities/nip-project/606803. There is a progress bar on this webpage that shows a project stage (under conceptualisation - completed). Do you have any suggestions how can I scrape this?

I am using RSelenium, extracting the page source and looking through it in the following way:

remDr$navigate('https://indiainvestmentgrid.gov.in/opportunities/nip-project/606803')
url <- read_html(remDr$getPageSource()[[1]])

project_title <- url %>% 
    html_nodes(".prj-name") %>%
    html_text()

However, I am not sure how to scrape this progress bar. Selector Gadget shows that the completed circles/bars are signed as ".active-stage", but I cannot find it in my HTML code. In the case of this project, it should be scraped as "Under Implementation".


Solution

  • It seems like you are using both RSelenium and rvest. Also, mind that html_nodes is deprecated. The coloring is of the bars is (I think) defined by the projectStageID. The following should work for most of those pages.

    library(rvest)
    library(magrittr
    
    url <- "https://indiainvestmentgrid.gov.in/opportunities/nip-project/606801"
    
    out <- read_html(url)
    
    out %>%
      html_elements(css = "#projectStageId") %>%
      as.character  %>%
      substr(start = 49, stop = nchar(.)-2) %>%
      switch(
        "500020" = "Under Conceptualization",
        "600037" = "Under Development",
        "500021" = "Under Implementation",
        "500023" = "Completed",
        NA
      )