Search code examples
rhrefscreen-scrapingrvest

scrape urls from a wikipedia table


Im trying to scrape the page https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads and can take the text data off fine using rvest

library(plyr)
library(XML)
library(rvest)
library(dplyr)
library(magrittr)
library(data.table)

for(i in 1:16)
{
float <- paste("squad", i, sep ="")
print(float)
html = read_html("https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads")
assign(float, html_table(html_nodes(html, "table")[[i]]))
}

but would also like to add an extra column to this with the URLs on each table for the club. e.g. for squad 1 (the polish squad on the page, truncated to show the first 5 players only)

     0#0 Pos.              Player                    Date of birth (age) Caps Goals                Club
1   1  1GK  Wojciech Szczęsny    (1990-04-18)18 April 1990 (aged 22)   11     0             Arsenal
2   2  2DF  Sebastian Boenisch  (1987-02-01)1 February 1987 (aged 25)    9     0       Werder Bremen
3   3  2DF Grzegorz Wojtkowiak  (1984-01-26)26 January 1984 (aged 28)   19     0        Lech Poznań
4   4  2DF    Marcin Kamiński  (1992-01-15)15 January 1992 (aged 20)    3     0        Lech Poznań
5   5  3MF       Dariusz Dudka  (1983-12-09)9 December 1983 (aged 28)   65     2             Auxerre
6   6  3MF     Adam Matuszczyk (1989-02-14)14 February 1989 (aged 23)   20     1 Fortuna Düsseldorf

I would like a column after "club" for "clubURL" that would show the wikipedia url for that club. For instance, the first player plays for Arsenal, so to take the link on the table for Arsenal and create:

0#0 Pos.             Player                 Date of birth (age) Caps Goals    Club
1   1  1GK Wojciech Szczęsny (1990-04-18)18 April 1990 (aged 22)   11     0 Arsenal
                                     clubURL
1 https://en.wikipedia.org/wiki/Arsenal_F.C.

and so on and so forth. I found rvest table scraping including links but couldn't get that example to work, nor for what I want to do. Sorry if it's been asked elsewhere,

thanks,


Solution

  • I made an example using the first table on the page. You can extend this as needed.

    First, grab the first table and save it using html_table. Then I created a helper function to extract the link from the table, given the link text. Then I used sapply to populate a new column in the dataframe.

    library("rvest")
    url <- "https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads"
    mytable <- read_html(url) %>% html_nodes("table") %>% .[[1]] 
    df <- mytable %>% html_table()
    
    get_link <- function(html_table, team){
      html_table %>% 
        html_nodes(xpath=paste0("//a[text()='", team, "']")) %>% 
        .[[1]] %>% 
        html_attr("href")
    }
    
    df$club_link <- sapply(df$Club, function(x)get_link(mytable, x))
    > head(df)
      0#0 Pos.              Player
    1   1  1GK  Wojciech Szczęsny
    2   2  2DF  Sebastian Boenisch
    3   3  2DF Grzegorz Wojtkowiak
    4   4  2DF    Marcin Kamiński
    5   5  3MF       Dariusz Dudka
    6   6  3MF     Adam Matuszczyk
                         Date of birth (age) Caps Goals
    1    (1990-04-18)18 April 1990 (aged 22)   11     0
    2  (1987-02-01)1 February 1987 (aged 25)    9     0
    3  (1984-01-26)26 January 1984 (aged 28)   19     0
    4  (1992-01-15)15 January 1992 (aged 20)    3     0
    5  (1983-12-09)9 December 1983 (aged 28)   65     2
    6 (1989-02-14)14 February 1989 (aged 23)   20     1
                     Club                     club_link
    1             Arsenal            /wiki/Arsenal_F.C.
    2       Werder Bremen        /wiki/SV_Werder_Bremen
    3        Lech Poznań        /wiki/Lech_Pozna%C5%84
    4        Lech Poznań        /wiki/Lech_Pozna%C5%84
    5             Auxerre              /wiki/AJ_Auxerre
    6 Fortuna Düsseldorf /wiki/Fortuna_D%C3%BCsseldorf