Search code examples
htmlrxpathrvestmagrittr

Rvest R not getting inner table


I'm trying to retrieve the Medals Table inside Wikipedia for Olympics 2012.

library(rvest) 
library(magrittr)
url <- "https://en.wikipedia.org/wiki/United_States_at_the_2012_Summer_Olympics" 

    xpath0 <- '//*[@id="mw-content-text"]/table[1]'
    xpath1 <- '//*[@id="mw-content-text"]/table[2]'
    xpath2 <- '//*[@id="mw-content-text"]/table[2]/tbody/tr/td[1]'
    xpath3 <- '//*[@id="mw-content-text"]/table[2]/tbody/tr/td[1]/table'

    tb <- url %>%
      html() %>%
      html_nodes(xpath=xpath0) %>%
      html_nodes("") %>%
      html_table()

xpath0 or xpath1 return an error

Error in parse_simple_selector(stream) : 
  Expected selector, got <EOF at 1>

xpath2 and xpath3 return empty lists.

At same time I tried to use Selectorgadget (https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html) to point to the exact element. I got

//td[(((count(preceding-sibling::) + 1) = 1) and parent::)] | //*[contains(concat( " ", @class, " " ), concat( " ", "headerSortDown", " " ))]

and the Error

Error in parse_simple_selector(stream) : Expected selector, got

I really appreciate any help.

Joa


Solution

  • The first table with the names has a complicated structure and seems to be very difficult to convert into a standard format. At least I didn't succeed.

    A summary of the number of medals by sport and the total medals can be obtained with

    library(rvest) #v.0.2.0.9000
    url <- "https://en.wikipedia.org/wiki/United_States_at_the_2012_Summer_Olympics" 
    tb <- read_html(url) %>% html_node("table.wikitable:nth-child(2)") %>% html_table(fill=TRUE)
    #> head(tb)
    #   Medals by sport   NA   NA   NA    NA NA NA
    #1            Sport 01 ! 02 ! 03 ! Total NA NA
    #2         Swimming   16    9    6    31 NA NA
    #3    Track & field    9   12    7    28 NA NA
    #4       Gymnastics    3    1    2     6 NA NA
    #5         Shooting    3    0    1     4 NA NA
    #6           Tennis    3    0    1     4 NA NA
    

    Then there is another table summarizing all competitors that you can get with

    tb2 <- read_html(url) %>% html_node("table.wikitable:nth-child(20)") %>% html_table()
    #> head(tb2)
    #                        Sport Men Women Total
    #1                     Archery   3     3     6
    #2 Athletics (track and field)  63    62   125
    #3                   Badminton   2     1     3
    #4                  Basketball  12    12    24
    #5                      Boxing   9     3    12
    #6                    Canoeing   5     2     7
    

    And this is the table of multiple medalists:

    tb3 <- read_html(url) %>%  html_node("table.wikitable:nth-child(8)") %>% html_table(fill=TRUE)
    #> head(tb3)
    #  Multiple medalists            NA   NA   NA   NA    NA NA
    #1               Name         Sport 01 ! 02 ! 03 ! Total NA
    #2     Michael Phelps      Swimming    4    2    0     6 NA
    #3     Missy Franklin      Swimming    4    0    1     5 NA
    #4    Allison Schmitt      Swimming    3    1    1     5 NA
    #5        Ryan Lochte      Swimming    2    2    1     5 NA
    #6      Allyson Felix Track & field    3    0    0     3 NA
    

    It really depends on which table you want to have, as pointed out by @Metrics. There are many tables on that page.