Search code examples
htmlcssrweb-scrapingrvest

How do I choose the correct selector for html_element() with rvest?


I am interested in using rvest to scrape some data and am using the following tutorial as a guide:
https://statsandr.com/blog/web-scraping-in-r/
I understand how to locate the correct table in the html broswer and have replicated the table as is shown in the tutorial.

library(rvest)
link <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
page <- read_html(link)

# scrape F1 driver table
drivers_F1 <- html_element(page, "table.sortable") %>%
  html_table()

From what I can tell, both of these tables have the same selectors. How do I find the correct selector to generate a table using html_element() for the following table ("List of Formula One drivers by country") on this wiki page?

Chrome html source

Does html_element() select the first instance of a matching selector by default? When using html_selectors(), I am able to view both of these tables:

html_elements(page, "table.sortable")
{xml_nodeset (2)}
[1] <table class="wikitable sortable" style="font-size: 85%; text-align:center">\n<caption>Formula One drivers by name\n</caption>\n<tbody>\n<tr> ...
[2] <table class="wikitable sortable" style="text-align:center; text-align:center; font-size:95%">\n<caption>List of Formula One drivers by count ...

Furthermore, I can generate the second table by selecting the second index element if I instead use the html_elements() function.

html_elements(page, "table.sortable")[[2]]%>%
  html_table()

# A tibble: 42 × 7
   Country          Totaldrivers Champions                                Championships         `Race wins` `First driver(s)` Most recent driver(s…¹
   <chr>            <chr>        <chr>                                    <chr>                 <chr>       <chr>             <chr>                 
 1 Argentinadetails 25           1(Fangio [5])                            5(1951, 1954, 1955, … "38\n(Fang… Juan Manuel Fang… Gastón Mazzacane(2001…
 2 Australiadetails 18           2(Brabham [3], Jones)                    4(1959, 1960, 1966, … "43\n(Brab… Tony Gaze(1952 B… Oscar Piastri, Daniel…
 3 Austriadetails   16           2(Rindt, Lauda [3])                      4(1970, 1975, 1977, … "41\n(Rind… Jochen Rindt(196… Christian Klien(2010 …
 4 Belgiumdetails   24           0                                        0                     "11\n(Ickx… Johnny Claes(195… Stoffel Vandoorne(201…
 5 Brazildetails    32           3(Fittipaldi [2], Piquet [3], Senna [3]) 8(1972, 1974, 1981, … "101\n(Fit… Chico Landi(1951… Pietro Fittipaldi(202…
 6 Canadadetails    15           1(J. Villeneuve)                         1(1997)               "17\n(G. V… Peter Ryan(1961 … Lance Stroll(2023 Abu…
 7 Chile            1            0                                        0                     "0"         Eliseo Salazar(1… Eliseo Salazar(1983 B…
 8 China            1            0                                        0                     "0"         Zhou Guanyu(2022… Zhou Guanyu(2023 Abu …
 9 Colombiadetails  3            0                                        0                     "7\n(Monto… Ricardo Londoño(… Juan Pablo Montoya(20…
10 Czech Republic   1            0                                        0                     "0"         Tomáš Enge(2001 … Tomáš Enge(2001 Japan…
# ℹ 32 more rows
# ℹ abbreviated name: ¹​`Most recent driver(s)/Current driver(s)`
# ℹ Use `print(n = ...)` to see more rows

Using html_elements() seems to get the job done fine in this case, but any insight into why html_element() functions the way it does would be appreciated.


Solution

  • html_elements returns a list of all matches, whereas html_element return

    a nodeset the same length as the input

    i.e. in case of multiple matches a nodeset containing the first match.

    Besides using html_elements and then picking the desired element from the returned list you could get the second table using html_element by being more specific, e.g. using the pseudo class :last-of-type you could select the last table or using :caption you could select based on a text, e.g. the table caption:

    library(rvest)
    link <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
    page <- read_html(link)
    
    drivers_F1 <- html_element(
      page, 
      "table.sortable:last-of-type"
    ) |>
      html_table()
    
    head(drivers_F1)
    #> # A tibble: 6 × 7
    #>   Country     Totaldrivers Champions Championships `Race wins` `First driver(s)`
    #>   <chr>       <chr>        <chr>     <chr>         <chr>       <chr>            
    #> 1 Argentinad… 25           1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
    #> 2 Australiad… 18           2(Brabha… 4(1959, 1960… "43\n(Brab… Tony Gaze(1952 B…
    #> 3 Austriadet… 16           2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
    #> 4 Belgiumdet… 24           0         0             "11\n(Ickx… Johnny Claes(195…
    #> 5 Brazildeta… 32           3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
    #> 6 Canadadeta… 15           1(J. Vil… 1(1997)       "17\n(G. V… Peter Ryan(1961 …
    #> # ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>
    
    drivers_F1 <- html_element(
      page,
      "table.sortable:contains('List of Formula One drivers by country')"
    ) |>
      html_table()
    
    head(drivers_F1)
    #> # A tibble: 6 × 7
    #>   Country     Totaldrivers Champions Championships `Race wins` `First driver(s)`
    #>   <chr>       <chr>        <chr>     <chr>         <chr>       <chr>            
    #> 1 Argentinad… 25           1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
    #> 2 Australiad… 18           2(Brabha… 4(1959, 1960… "43\n(Brab… Tony Gaze(1952 B…
    #> 3 Austriadet… 16           2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
    #> 4 Belgiumdet… 24           0         0             "11\n(Ickx… Johnny Claes(195…
    #> 5 Brazildeta… 32           3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
    #> 6 Canadadeta… 15           1(J. Vil… 1(1997)       "17\n(G. V… Peter Ryan(1961 …
    #> # ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>