Im trying to scrape the page https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads and can take the text data off fine using rvest
library(plyr)
library(XML)
library(rvest)
library(dplyr)
library(magrittr)
library(data.table)
for(i in 1:16)
{
float <- paste("squad", i, sep ="")
print(float)
html = read_html("https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads")
assign(float, html_table(html_nodes(html, "table")[[i]]))
}
but would also like to add an extra column to this with the URLs on each table for the club. e.g. for squad 1 (the polish squad on the page, truncated to show the first 5 players only)
0#0 Pos. Player Date of birth (age) Caps Goals Club
1 1 1GK Wojciech Szczęsny (1990-04-18)18 April 1990 (aged 22) 11 0 Arsenal
2 2 2DF Sebastian Boenisch (1987-02-01)1 February 1987 (aged 25) 9 0 Werder Bremen
3 3 2DF Grzegorz Wojtkowiak (1984-01-26)26 January 1984 (aged 28) 19 0 Lech Poznań
4 4 2DF Marcin Kamiński (1992-01-15)15 January 1992 (aged 20) 3 0 Lech Poznań
5 5 3MF Dariusz Dudka (1983-12-09)9 December 1983 (aged 28) 65 2 Auxerre
6 6 3MF Adam Matuszczyk (1989-02-14)14 February 1989 (aged 23) 20 1 Fortuna Düsseldorf
I would like a column after "club" for "clubURL" that would show the wikipedia url for that club. For instance, the first player plays for Arsenal, so to take the link on the table for Arsenal and create:
0#0 Pos. Player Date of birth (age) Caps Goals Club
1 1 1GK Wojciech Szczęsny (1990-04-18)18 April 1990 (aged 22) 11 0 Arsenal
clubURL
1 https://en.wikipedia.org/wiki/Arsenal_F.C.
and so on and so forth. I found rvest table scraping including links but couldn't get that example to work, nor for what I want to do. Sorry if it's been asked elsewhere,
thanks,
I made an example using the first table on the page. You can extend this as needed.
First, grab the first table and save it using html_table
. Then I created a helper function to extract the link from the table, given the link text. Then I used sapply
to populate a new column in the dataframe.
library("rvest")
url <- "https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads"
mytable <- read_html(url) %>% html_nodes("table") %>% .[[1]]
df <- mytable %>% html_table()
get_link <- function(html_table, team){
html_table %>%
html_nodes(xpath=paste0("//a[text()='", team, "']")) %>%
.[[1]] %>%
html_attr("href")
}
df$club_link <- sapply(df$Club, function(x)get_link(mytable, x))
> head(df)
0#0 Pos. Player
1 1 1GK Wojciech Szczęsny
2 2 2DF Sebastian Boenisch
3 3 2DF Grzegorz Wojtkowiak
4 4 2DF Marcin Kamiński
5 5 3MF Dariusz Dudka
6 6 3MF Adam Matuszczyk
Date of birth (age) Caps Goals
1 (1990-04-18)18 April 1990 (aged 22) 11 0
2 (1987-02-01)1 February 1987 (aged 25) 9 0
3 (1984-01-26)26 January 1984 (aged 28) 19 0
4 (1992-01-15)15 January 1992 (aged 20) 3 0
5 (1983-12-09)9 December 1983 (aged 28) 65 2
6 (1989-02-14)14 February 1989 (aged 23) 20 1
Club club_link
1 Arsenal /wiki/Arsenal_F.C.
2 Werder Bremen /wiki/SV_Werder_Bremen
3 Lech Poznań /wiki/Lech_Pozna%C5%84
4 Lech Poznań /wiki/Lech_Pozna%C5%84
5 Auxerre /wiki/AJ_Auxerre
6 Fortuna Düsseldorf /wiki/Fortuna_D%C3%BCsseldorf