Search code examples
rweb-scrapingrselenium

web-scraping from a website that does not change URL


I am very new to web-scraping, and I am having some difficulty scraping this website's content. I basic would like to collect the pesticide name and active ingredient, but the URL does not change, and I could not find a way to click the grids. Any help?

library(RSelenium)
library(rvest)
library(tidyverse)

rD <- rsDriver(browser="firefox", port=4547L, verbose=F)
remDr <- rD[["client"]]

remDr$navigate("http://www.cdms.net/Label-Database")

Solution

  • This site calls an API to get the list of manufacturers: http://www.cdms.net/labelssds/Home/ManList?Keys=

    On the products page, it also uses another API with the manufacturer ID, for example: http://www.cdms.net/labelssds/Home/ProductList?manId=537

    You just need to loop through the Lst array and append the result to a dataframe. For instance, the following code get all the products for the first 5 manufacturers :

    library(httr)
    
    manufacturers <- content(GET("http://www.cdms.net/labelssds/Home/ManList?Keys="), as = "parsed", type = "application/json")
    maxManufacturer <- 5
    
    index <- 1
    manufacturerCount <- 0
    data = list()
    
    for(m in manufacturers$Lst){
      print(m$label)
      productUrl <- modify_url("http://www.cdms.net/labelssds/Home/ProductList", 
        query = list(
          "manId" = m$value
        )
      )
      products <- content(GET(productUrl), as = "parsed", type = "application/json")
    
      for(p in products$Lst){
        data[[index]] = p
        index <- index + 1
      }
    
      manufacturerCount <- manufacturerCount + 1
      if (manufacturerCount == maxManufacturer){
        break
      }
      Sys.sleep(0.500) #add delay for scraping
    }
    
    df <- do.call(rbind, data)
    options(width = 1200)
    print(df)