Search code examples
rweb-scrapingrvest

Scraping name(values) from attributes in rvest R


I want to scrape following webpage (it is allowed..):

https://www.bisafans.de/pokedex/listen/numerisch.php

the aim is to extract a table like following:

number name type1 type2
001 Bisasam Pflanze Gift
002 ... ... ...

I was able to scrape the number and name of the table but I have problem to extract the types since they are hidden as an image title:

>img src="https://media.bisafans.de/f630aa6/typen/pflanze.png" alt="Pflanze"<

How can I extract the name after alt? I already tried it with extracting the whole table, which only extracts numbers and names. Another approach was the html_attr(), but doesn't work either.

Does someone know how I can achieve this?


Solution

  • This is nice and easy with the right css selector lists and processing data as a list of table rows inside nested map_dfr(data.frame()) calls.

    Inside data.frame() you can leverage the fact NA is returned when a css selector list does not match in the DOM to ensure equal column lengths. Specify a selector list for each possible column entry.

    library(tidyverse)
    library(rvest)
    
    rows <- read_html("https://www.bisafans.de/pokedex/listen/numerisch.php") %>% html_elements(".table tbody tr")
    
    df <- map_dfr(rows, ~ data.frame(
      `Nr.` = .x %>% html_element("td:first-child") %>% html_text(),
      `Pokémon` = .x %>% html_element("a") %>% html_text(),
      `Type1` = .x %>% html_element("td:last-child > a:nth-child(odd) > img") %>% html_attr("alt"),
      `Type2` = .x %>% html_element("td:last-child > a:nth-child(even) > img") %>% html_attr("alt")
    ))