I am interested in using rvest to scrape some data and am using the following tutorial as a guide:
https://statsandr.com/blog/web-scraping-in-r/
I understand how to locate the correct table in the html broswer and have replicated the table as is shown in the tutorial.
library(rvest)
link <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
page <- read_html(link)
# scrape F1 driver table
drivers_F1 <- html_element(page, "table.sortable") %>%
html_table()
From what I can tell, both of these tables have the same selectors. How do I find the correct selector to generate a table using html_element() for the following table ("List of Formula One drivers by country") on this wiki page?
Does html_element() select the first instance of a matching selector by default? When using html_selectors(), I am able to view both of these tables:
html_elements(page, "table.sortable")
{xml_nodeset (2)}
[1] <table class="wikitable sortable" style="font-size: 85%; text-align:center">\n<caption>Formula One drivers by name\n</caption>\n<tbody>\n<tr> ...
[2] <table class="wikitable sortable" style="text-align:center; text-align:center; font-size:95%">\n<caption>List of Formula One drivers by count ...
Furthermore, I can generate the second table by selecting the second index element if I instead use the html_elements() function.
html_elements(page, "table.sortable")[[2]]%>%
html_table()
# A tibble: 42 × 7
Country Totaldrivers Champions Championships `Race wins` `First driver(s)` Most recent driver(s…¹
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Argentinadetails 25 1(Fangio [5]) 5(1951, 1954, 1955, … "38\n(Fang… Juan Manuel Fang… Gastón Mazzacane(2001…
2 Australiadetails 18 2(Brabham [3], Jones) 4(1959, 1960, 1966, … "43\n(Brab… Tony Gaze(1952 B… Oscar Piastri, Daniel…
3 Austriadetails 16 2(Rindt, Lauda [3]) 4(1970, 1975, 1977, … "41\n(Rind… Jochen Rindt(196… Christian Klien(2010 …
4 Belgiumdetails 24 0 0 "11\n(Ickx… Johnny Claes(195… Stoffel Vandoorne(201…
5 Brazildetails 32 3(Fittipaldi [2], Piquet [3], Senna [3]) 8(1972, 1974, 1981, … "101\n(Fit… Chico Landi(1951… Pietro Fittipaldi(202…
6 Canadadetails 15 1(J. Villeneuve) 1(1997) "17\n(G. V… Peter Ryan(1961 … Lance Stroll(2023 Abu…
7 Chile 1 0 0 "0" Eliseo Salazar(1… Eliseo Salazar(1983 B…
8 China 1 0 0 "0" Zhou Guanyu(2022… Zhou Guanyu(2023 Abu …
9 Colombiadetails 3 0 0 "7\n(Monto… Ricardo Londoño(… Juan Pablo Montoya(20…
10 Czech Republic 1 0 0 "0" Tomáš Enge(2001 … Tomáš Enge(2001 Japan…
# ℹ 32 more rows
# ℹ abbreviated name: ¹`Most recent driver(s)/Current driver(s)`
# ℹ Use `print(n = ...)` to see more rows
Using html_elements() seems to get the job done fine in this case, but any insight into why html_element() functions the way it does would be appreciated.
html_elements
returns a list of all matches, whereas html_element
return
a nodeset the same length as the input
i.e. in case of multiple matches a nodeset containing the first match.
Besides using html_elements
and then picking the desired element from the returned list you could get the second table using html_element
by being more specific, e.g. using the pseudo class :last-of-type
you could select the last table or using :caption
you could select based on a text, e.g. the table caption:
library(rvest)
link <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
page <- read_html(link)
drivers_F1 <- html_element(
page,
"table.sortable:last-of-type"
) |>
html_table()
head(drivers_F1)
#> # A tibble: 6 × 7
#> Country Totaldrivers Champions Championships `Race wins` `First driver(s)`
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Argentinad… 25 1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
#> 2 Australiad… 18 2(Brabha… 4(1959, 1960… "43\n(Brab… Tony Gaze(1952 B…
#> 3 Austriadet… 16 2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
#> 4 Belgiumdet… 24 0 0 "11\n(Ickx… Johnny Claes(195…
#> 5 Brazildeta… 32 3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
#> 6 Canadadeta… 15 1(J. Vil… 1(1997) "17\n(G. V… Peter Ryan(1961 …
#> # ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>
drivers_F1 <- html_element(
page,
"table.sortable:contains('List of Formula One drivers by country')"
) |>
html_table()
head(drivers_F1)
#> # A tibble: 6 × 7
#> Country Totaldrivers Champions Championships `Race wins` `First driver(s)`
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Argentinad… 25 1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
#> 2 Australiad… 18 2(Brabha… 4(1959, 1960… "43\n(Brab… Tony Gaze(1952 B…
#> 3 Austriadet… 16 2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
#> 4 Belgiumdet… 24 0 0 "11\n(Ickx… Johnny Claes(195…
#> 5 Brazildeta… 32 3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
#> 6 Canadadeta… 15 1(J. Vil… 1(1997) "17\n(G. V… Peter Ryan(1961 …
#> # ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>