Search code examples
rclassweb-scrapingrvest

How to scrap a table from website while its class isn't a table


I want to scrape the player data table from the following URL:

https://www.transfermarkt.de/mamadou-doucoure/profil/spieler/340480

Here's what I coded:

x <- read_html(url) %>%
        html_node(xpath = '//div[@class="row collapse"]') %>%
        html_table(fill = TRUE) %>% 
        as.data.frame() %>%
        set_names(.,letters[1:ncol(.)])

As far as I understand, the player data isn't classed as a table, and I don't know how to edit the code. Also, I want to have the output in a data frame.


Solution

  • Dataframe could have many forms, having that player table in dataframe as-is might not be the most practical way, though here are a few examples. Some parts are bit tricky and solving those correctly depends on context and objective (e.g. multiple nationalities that currently end up as a single collapsed value)

    library(rvest)
    library(dplyr, warn.conflicts = F)
    library(tidyr)
    library(stringr)
    
    url <- "https://www.transfermarkt.de/mamadou-doucoure/profil/spieler/340480"
    html <- read_html(url)
    
    # most basic aproach to extract just what's in the table + player name:
    df_01 <- tibble(
      feature = html_elements(html, "div.info-table > span.info-table__content--regular") %>% html_text() %>% str_squish(),
      text = html_elements(html, "div.info-table > span.info-table__content--bold") %>% html_text() %>% str_squish()
    ) %>%
      # player name is not included in div.info-table, add it separately
      add_row(.before = 1,
                  feature = "Player:",
                  text = html_elements(html, "header > div.data-header__headline-container > h1") %>% html_text() %>% str_squish())
    
    df_01
    #> # A tibble: 15 × 2
    #>    feature              text                                   
    #>    <chr>                <chr>                                  
    #>  1 Player:              "#4 Mamadou Doucouré"                  
    #>  2 Geburtsdatum:        "21.05.1998"                           
    #>  3 Geburtsort:          "Dakar"                                
    #>  4 Alter:               "24"                                   
    #>  5 Größe:               "1,83 m"                               
    #>  6 Nationalität:        "Frankreich Senegal"                   
    #>  7 Position:            "Abwehr - Innenverteidiger"            
    #>  8 Fuß:                 "links"                                
    #>  9 Spielerberater:      "Sport Avenir Management International"
    #> 10 Aktueller Verein:    "Borussia Mönchengladbach"             
    #> 11 Im Team seit:        "01.07.2016"                           
    #> 12 Vertrag bis:         "30.06.2024"                           
    #> 13 Letzte Verlängerung: "14.02.2020"                           
    #> 14 2. Verein:           "Borussia Mönchengladbach II (#3)"     
    #> 15 Social Media:        ""
    

    To include URLs we handle the first info-table column as before but processes 2nd one through map - not all entries have URLs and we don't want to end up with misaligned columns with different lengths:

    df_02 <- tibble(
      feature = html_elements(html, "div.info-table > span.info-table__content--regular") %>% html_text() %>% str_squish(),
    ) %>% bind_cols(
      purrr::map_df(
        html_elements(html, "div.info-table > span.info-table__content--bold"), 
        ~ list(
          html_text(.x) %>% stringr::str_squish() %>% na_if(""),
          html_element(.x, "a") %>% html_attr("href") 
        ) %>% setNames(c("text", "url"))
      )
    ) %>% add_row(.before = 1,
                feature = "Player:",
                text = html_elements(html, "header > div.data-header__headline-container > h1") %>% html_text() %>% stringr::str_squish())
    
    df_02
    #> # A tibble: 15 × 3
    #>    feature              text                                  url               
    #>    <chr>                <chr>                                 <chr>             
    #>  1 Player:              #4 Mamadou Doucouré                   <NA>              
    #>  2 Geburtsdatum:        21.05.1998                            /aktuell/waspassi…
    #>  3 Geburtsort:          Dakar                                 <NA>              
    #>  4 Alter:               24                                    <NA>              
    #>  5 Größe:               1,83 m                                <NA>              
    #>  6 Nationalität:        Frankreich Senegal                    <NA>              
    #>  7 Position:            Abwehr - Innenverteidiger             <NA>              
    #>  8 Fuß:                 links                                 <NA>              
    #>  9 Spielerberater:      Sport Avenir Management International /sport-avenir-man…
    #> 10 Aktueller Verein:    Borussia Mönchengladbach              /borussia-monchen…
    #> 11 Im Team seit:        01.07.2016                            <NA>              
    #> 12 Vertrag bis:         30.06.2024                            <NA>              
    #> 13 Letzte Verlängerung: 14.02.2020                            <NA>              
    #> 14 2. Verein:           Borussia Mönchengladbach II (#3)      /borussia-monchen…
    #> 15 Social Media:        <NA>                                  http://www.instag…
    

    To have a tidy dataframe that could potentially take more players, missing text values are replaced by URLs and separate URL column is dropped:

    df_03 <- df_02 %>% 
      mutate(feature = janitor::make_clean_names(feature),
            `text` = coalesce(`text`,url))  %>% 
      select(-url) %>% 
      pivot_wider(names_from = feature, values_from = text) %>% 
      extract(player, into = c("number", "player"), "^#(\\d+) (.*)")
    
    glimpse(df_03)
    #> Rows: 1
    #> Columns: 16
    #> $ number              <chr> "4"
    #> $ player              <chr> "Mamadou Doucouré"
    #> $ geburtsdatum        <chr> "21.05.1998"
    #> $ geburtsort          <chr> "Dakar"
    #> $ alter               <chr> "24"
    #> $ grosse              <chr> "1,83 m"
    #> $ nationalitat        <chr> "Frankreich Senegal"
    #> $ position            <chr> "Abwehr - Innenverteidiger"
    #> $ fuss                <chr> "links"
    #> $ spielerberater      <chr> "Sport Avenir Management International"
    #> $ aktueller_verein    <chr> "Borussia Mönchengladbach"
    #> $ im_team_seit        <chr> "01.07.2016"
    #> $ vertrag_bis         <chr> "30.06.2024"
    #> $ letzte_verlangerung <chr> "14.02.2020"
    #> $ x2_verein           <chr> "Borussia Mönchengladbach II (#3)"
    #> $ social_media        <chr> "http://www.instagram.com/mams_dcr/"