Search code examples
rweb-scrapingrvesthttrrselenium

R / Rvest / RSelenium: scrape data from JS Sites


I am new to the web scraping topic with R and Rvest. With rvest you can scrape static HTML but I have found out that rvest struggeling to scrape data from heavy JS based Sites.

I found some articels or blog posts but they seems depricated like https://awesomeopensource.com/project/yusuzech/r-web-scraping-cheat-sheet

In my case i want scrape odds from Sport Betting Sites but with rvest and SelectorGadget this isnt possible in my Opinion because of the JS.

There is an Articel from 2018 about scraping Odds from PaddyPower(https://www.r-bloggers.com/how-to-scrape-data-from-a-javascript-website-with-r/) but this is out dated too, because PhantomJS isnt available anymore. RSelenium seems to be an option but the repo has many issues https://github.com/ropensci/RSelenium.

So is it possible to work with RSelenium in its current state or what options do I have instead of RSelenium?

kind regards


Solution

  • I've had no problems using RSelenium with the help of the wdman package, which allowed me to just not bother with Docker. wdman also fetches all binaries you need if they aren't already available. It's nice magic.
    Here's a simple script to spin up a Selenium instance with Chrome, open a site, get the contents as xml and then close it all down again.

    library(wdman)
    library(RSelenium)
    library(xml2)
    
    # start a selenium server with wdman, running the latest chrome version
    selServ <- wdman::selenium(
      port = 4444L,
      version = 'latest',
      chromever = 'latest'
    )
    
    # start your chrome Driver on the selenium server
    remDr <- remoteDriver(
      remoteServerAddr = 'localhost',
      port = 4444L,
      browserName = 'chrome'
    )
    
    # open a selenium browser tab
    remDr$open()
    
    # navigate to your site
    remDr$navigate(some_url)
    
    # get the html contents of that site as xml tree
    page_xml <- xml2::read_html(remDr$getPageSource()[[1]])
    
    # do your magic
    # ... check doc at `?remoteDriver` to see what your remDr object can help you do.
    
    # clean up after you
    remDr$close()
    selServ$stop()