Search code examples
rweb-scrapinghrefrvesthtml-target

How to select "href" of a web page of a specific "target"?


<a class="image teaser-image ng-star-inserted" target="_self" href="/politik/inland/neuwahlen-2022-welche-szenarien-jetzt-realistisch-sind/401773131">

I just want to extract the "href" (for example the upper HTML tag) in order to concat it with the domain name of this website "https://kurier.at" and web scrape all articles on the home page.

I tried the following code

library(rvest)
library(lubridate)


kurier_wbpg <- read_html("https://kurier.at")

# I just want the "a" tags which come with the attribute "_self" 

articleLinks <- kurier_wbpg %>% html_elements("a")%>%
html_elements(css = "tag[attribute=_self]")  %>% 
html_attr("href")%>% 
paste("https://kurier.at",.,sep = "")

When I execute up to the html_attr("href") part of the above code block, the result I get is

character(0)

I think something wrong with selecting the HTML element tag. I need some help with this?


Solution

  • You need to narrow down your css to the second teaser block image which you can do by using the naming conventions of the classes. You can use url_absolute() to add the domain.

    library(rvest)
    library(magrittr)
    
    url <- 'https://kurier.at/'
    result <- read_html(url) %>% 
      html_element('.teasers-2 .image') %>% 
      html_attr('href') %>% 
      url_absolute(url)
    

    Same principle to get all teasers:

    results <- read_html(url) %>% 
      html_elements('.teaser .image') %>% 
      html_attr('href') %>% 
      url_absolute(url)
    

    Not sure if you want the bottom block of 5 included. If so, you can again use classes

    articles <- read_html(url) %>% 
      html_elements('.teaser-title') %>% 
      html_attr('href') %>% 
      url_absolute(url)