Search code examples
rrvest

Web scraping using rvest works partially ok


I'm new into web scraping using rvest in R and I'm trying to access to the left column match names of this betting house using xpath. So i know the names are under the tag. But i cant access to them using the next code:

html="https://www.supermatch.com.uy/live#/5009370062"
a=read_html(html)
a %>% html_nodes(xpath="//span") %>% html_text()

But i only access to some of the text. I was reading that this may be because the website dynamically pull data from databases using JavaScript and jQuery. Do you know how i can access to these match names? Already thank you guys.


Solution

  • Some generic notes about basic scraping strategies

    Following refers to Google Chrome and Chrome DevTools, but those same concepts apply to other browsers and built-in developer tools too. One thing to remember about rvest is that it can only handle response delivered for that specific request, i.e. content that is not fetched / transformed / generated by JavasScript running on the client side.

    • Loading the page and inspecting elements to extract xpath or css selector for rvest seems to be most most common approach. Though the static content behind that URL versus the rendered page in browser and elemts in inspector can be quite different. To take some guesswork out of the process, it's better to start by checking what is the actual content that rvest might receive - open the page source and skim through it or just search for a term you are interested in. At the time of writing Viettel is playing, but they are not listed anywhere in the source: search 1 Meaning there's no reason to expect that rvest would be able to extract that data.
    • You could also disable JavaScript for that particular site in your browser and check if that particular piece of information is still there. If not, it's not there for rvest either.
    • If you want to step further and/or suspect that rvest receives something different compared to your browser session (target site is checking request headers and delivers some anti-scraping notice when it doesn't like the user-agent, for example), you can always check the actual content rvest was able to retrieve, for example read_html(some_url) %>% as.character() to dump the whole response, read_html(some_url) %>% xml2::html_structure() to get formatted stucture of the page or read_html(some_url) %>% xml2::write_html("temp.html") to save the page content and inspect it in editor or browser.
    • Coming back to Supermatch & DevTools. That data on a left pane must be coming from somewhere. What usually works is a search on the network pane - open network, clear current content, refresh the page and make sure page is fully loaded; run a search (for "Viettel" for example): search 2 And you'll have the URL from there. There are some IDs in that request (https://www.supermatch.com.uy/live_recargar_menu/32512079?_=1656070333214) and it's wise to assume those values might be related to current session or are just shortlived. So sometimes it's worth trying what would happen if we just clean it up a bit, i.e. remove 32512079?_=1656070333214. In this case it happens to work.
    • While here it's just a fragment of html and it makes sense to parse it with rvest , in most cases you'll end up landing on JSON and the process transforms into working with APIs. When it happens it's time to switch from rvest to something more apropriate for JSON, jsonlite + httr for example.
    • Sometimes plane rvest is not enough and you either want or need to work with the page as it would have been rendered in your JavaScript-enabled browser. For this there's RSelenium