My question is actually the same as the one asked here : Scraping html table and its href Links in R
But the solution provided does not work in my case...or there is something I didn't understand... In my case, the webpage has more than a table and I don't know how to target a specific table with the solution provided in the other question...
For example in this webpage https://en.wikipedia.org/wiki/UEFA_Champions_League, how would I focus on the table "All time top scorers"? How would I get the links for the columns "Player","Country" and "Club(s)"?
I tried something like
links = read_html("https://en.wikipedia.org/wiki/UEFA_Champions_League") %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[5]')%>%
html_nodes(xpath = '//td/a')%>% html_attr("href")
But it keeps giving me other links.
Besides, there is another difficulty that some names are in bold here and some are not...
You can download all the tables present in the page and select the one which you need.
library(rvest)
url <- 'https://en.wikipedia.org/wiki/UEFA_Champions_League'
all_tables <- url %>%
read_html() %>%
html_nodes('table.wikitable') %>%
html_table(fill = TRUE)
So in your case, you need
all_tables[[4]]
# Player Country Goals Apps Ratio Years ....
#1 1 Cristiano Ronaldo Portugal 128 168 0.76 2003– ....
#2 2 Lionel Messi Argentina 114 140 0.81 2005– ....
#3 3 Raúl Spain 71 142 0.50 1995–2011 ....
#4 4 Karim Benzema France 64 118 0.54 2006– ....
#5 5 Robert Lewandowski Poland 63 85 0.74 2011– ....
#6 6 Ruud van Nistelrooy Netherlands 56 73 0.77 1998–2009 ....
#7 7 Thierry Henry France 50 112 0.45 1997–2010 ....
#8 8 Alfredo Di Stéfano Argentina 49 58 0.84 1955–1964 ....
#9 9 Andriy Shevchenko Ukraine 48 100 0.48 1994–2012 ....
#10 9 Zlatan Ibrahimović Sweden 48 120 0.40 2001–2017 ....
You might also be interested in WikipediR
package which helps to retrieve data from Wikipedia.
To get the href
links from that table we can do
url %>%
read_html() %>%
html_nodes('table.wikitable') %>%
.[[4]] %>%
html_nodes('a') %>%
html_attr('href') %>%
paste0('https://en.wikipedia.org', .)
#[1] "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
#[2] "https://en.wikipedia.org/wiki/Portugal"
#[3] "https://en.wikipedia.org/wiki/Manchester_United_F.C."
#....