I want to scrape following webpage (it is allowed..):
https://www.bisafans.de/pokedex/listen/numerisch.php
the aim is to extract a table like following:
number | name | type1 | type2 |
---|---|---|---|
001 | Bisasam | Pflanze | Gift |
002 | ... | ... | ... |
I was able to scrape the number and name of the table but I have problem to extract the types since they are hidden as an image title:
>img src="https://media.bisafans.de/f630aa6/typen/pflanze.png" alt="Pflanze"<
How can I extract the name after alt
? I already tried it with extracting the whole table, which only extracts numbers and names. Another approach was the html_attr()
, but doesn't work either.
Does someone know how I can achieve this?
This is nice and easy with the right css selector lists and processing data as a list of table rows inside nested map_dfr(data.frame())
calls.
Inside data.frame()
you can leverage the fact NA
is returned when a css selector list does not match in the DOM to ensure equal column lengths. Specify a selector list for each possible column entry.
library(tidyverse)
library(rvest)
rows <- read_html("https://www.bisafans.de/pokedex/listen/numerisch.php") %>% html_elements(".table tbody tr")
df <- map_dfr(rows, ~ data.frame(
`Nr.` = .x %>% html_element("td:first-child") %>% html_text(),
`Pokémon` = .x %>% html_element("a") %>% html_text(),
`Type1` = .x %>% html_element("td:last-child > a:nth-child(odd) > img") %>% html_attr("alt"),
`Type2` = .x %>% html_element("td:last-child > a:nth-child(even) > img") %>% html_attr("alt")
))