Search code examples
rselenium-webdriverweb-scrapingrselenium

Scraping multiple tabs under the same page using Rselenium


I need to scrape the multiple tabs within the same page. (I need to scrape Summary, attack and other tabs that falls under Player Statistics) webpage: https://www.sofascore.com/southampton-wolverhampton/dsV

I tried to extend the knowledge I gathered from this post for this purpose: (Web scraping data inside a tab using Rselenium)

Could you help me figure out what I should I do to get the correct results? I am trying to learn this web scraping to extract data for a university project.

thank you for your help

   rD <- rsDriver(browser="firefox", port=free_port(), chromever = NULL)
    remDr <- rD[["client"]]
   remDr$open()
   url <- "https://www.sofascore.com/southampton-wolverhampton/dsV"

    remDr$navigate(url)
    
    remDr$findElement(using = "css",value = ".iiSsIo")

I tried to target the .iiSsIo class as it contain all the tabs that I need.

But It gave me this error:

Selenium message:Unable to locate element: .iiSsIo
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'DESKTOP-MOGN5AG', ip: '192.168.0.114', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '19.0.2'
Driver info: driver.version: unknown

Error:   Summary: NoSuchElement
     Detail: An element could not be located on the page using the given search parameters.
     class: org.openqa.selenium.NoSuchElementException
     Further Details: run errorDetails method

Solution

  • Sorry I might not have given the best advise in the comments section of the other question :D

    Here's a way to loop through all the tabs. The key thing to remember is that this page is dynamic and will only show certain parts of its html after other buttons are clicked first. My guess is that you couldn't find the class you are looking for because you didn't tell RSelenium to click the player statistics tab first. I also noticed that this site only displays the table if the window is in full screen mode, so maybe that was it. Not sure.

    Anyways this worked for me:

    # load libraries
    library(RSelenium)
    library(rvest)
    library(magrittr)
    
    # define target url
    url <- "https://www.sofascore.com/southampton-wolverhampton/dsV"
    
    
    # start RSelenium ------------------------------------------------------------
    
    rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
    remDr <- rD[["client"]]
    
    # open the remote driver-------------------------------------------------------
    remDr$open()
    
    # Navigate to webpage -----------------------------------------------------
    remDr$navigate(url)
    

    After we navigate to the page, then we have to click the player statistics tab

    # click on the player statistics tab ------------------------------------
    remDr$findElement(using = "css",value = ".fircAT > div:nth-child(2)")$clickElement()
    

    Then after that tabs has loaded we can pull the page's html (this webpage's html is changing depending on different user inputs, so clicking on a button and then pulling the html will give a different result than if we pulled the html without first clicking on the button.

    
    # pull the webpage html
    # then read it
    page_html <- remDr$getPageSource()[[1]] %>% 
      read_html() 
    

    Here we use the rvest package to find the node you were looking for, and then count how many children are inside that node

    n_children <- page_html %>% html_node(".iiSsIo") %>% html_children()  %>% length()
    

    Each child node has a standard format, so we can paste the node name together with the number like this:

    children <- paste0("button.sc-gswNZR:nth-child(",1:n_children,")")
    

    Then we can loop through each element in the children vector and click on the corresponding tab:

    for (child in children){
      
      remDr$findElement(using = "css",value = child )$clickElement()
      
      # just to slow things down so we can see what's happening
      Sys.sleep(1)
    }
    
    

    You could add this to the loop to pull each table:

    # define empty list
    tables <- list()
    
    
    for (child in children){
      
     # click on each tab
      remDr$findElement(using = "css",value = child )$clickElement()
      
    # pull the html and extract the table for each tab
      table <- remDr$getPageSource()[[1]] %>% 
        read_html() %>%
        html_table()
      
    # add the tab's table to the list
      tables <- c(tables, table)
      
    }
    
    
    
    tables