Search code examples
htmlrxpathweb-scrapinghref

Scraping html table and its href Links in R when there are more than one table and particularities


My question is actually the same as the one asked here : Scraping html table and its href Links in R

But the solution provided does not work in my case...or there is something I didn't understand... In my case, the webpage has more than a table and I don't know how to target a specific table with the solution provided in the other question...

For example in this webpage https://en.wikipedia.org/wiki/UEFA_Champions_League, how would I focus on the table "All time top scorers"? How would I get the links for the columns "Player","Country" and "Club(s)"?

I tried something like

links = read_html("https://en.wikipedia.org/wiki/UEFA_Champions_League") %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[5]')%>% 
  html_nodes(xpath = '//td/a')%>% html_attr("href") 

But it keeps giving me other links.

Besides, there is another difficulty that some names are in bold here and some are not...


Solution

  • You can download all the tables present in the page and select the one which you need.

    library(rvest)
    url <- 'https://en.wikipedia.org/wiki/UEFA_Champions_League'
    all_tables <- url %>%
                   read_html() %>%
                   html_nodes('table.wikitable') %>%
                   html_table(fill = TRUE)
    

    So in your case, you need

    all_tables[[4]]
    
    #                  Player     Country Goals Apps Ratio     Years ....
    #1  1   Cristiano Ronaldo    Portugal   128  168  0.76     2003– ....
    #2  2        Lionel Messi   Argentina   114  140  0.81     2005– ....
    #3  3                Raúl       Spain    71  142  0.50 1995–2011 ....
    #4  4       Karim Benzema      France    64  118  0.54     2006– ....
    #5  5  Robert Lewandowski      Poland    63   85  0.74     2011– ....
    #6  6 Ruud van Nistelrooy Netherlands    56   73  0.77 1998–2009 ....
    #7  7       Thierry Henry      France    50  112  0.45 1997–2010 ....
    #8  8  Alfredo Di Stéfano   Argentina    49   58  0.84 1955–1964 ....
    #9  9   Andriy Shevchenko     Ukraine    48  100  0.48 1994–2012 ....
    #10 9  Zlatan Ibrahimović      Sweden    48  120  0.40 2001–2017 ....
    

    You might also be interested in WikipediR package which helps to retrieve data from Wikipedia.


    To get the href links from that table we can do

    url %>%
      read_html() %>%
      html_nodes('table.wikitable') %>%
      .[[4]] %>%
      html_nodes('a') %>%
      html_attr('href') %>%
      paste0('https://en.wikipedia.org', .)
    
    #[1] "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"       
    #[2] "https://en.wikipedia.org/wiki/Portugal"                
    #[3] "https://en.wikipedia.org/wiki/Manchester_United_F.C." 
    #....