Search code examples
htmlrweb-scrapingrvest

obtaining the first link for a search result with the `rvest` package in R


NOTE: this question does not violate Google's Terms of Service. The use case and purposes are compliant and within the ToS.

The object url is an atomic character vector that stores a URL:

url <- "https://www.google.com/search?q=Kendrick+Lamar+aoty"

I need to obtain the first non-ad link that Google's search results return and store it as the object link. Here's what I do now:

library(rvest)

url <- "https://www.google.com/search?q=Kendrick+Lamar+aoty"

search_page <- read_html(url)

first_link <- search_page %>%
  html_nodes(".g:not(.g .adsbygoogle)") %>%
  html_node("a") %>%
  html_attr("href") %>%
  URLdecode()

I get the following error:

Error in parse_simple_selector(stream) : Expected ')', got .

The issue looks like my call to html_nodes() function. I have modified the CSS classifier in several ways but have not been successful in resolving the error. The result I'm expecting to store in the first_link object based on the example above is:

https://www.albumoftheyear.org/artist/1881-kendrick-lamar/

What am I doing wrong with the html_nodes() function (or something else that I'm missing if I want to get the desired output as shown above)?


Solution

  • Is this solution good for you?

    library(rvest)
    
    url <- "https://www.google.com/search?q=Kendrick+Lamar+aoty"
    search_page <- read_html(url)
    
    links <- search_page |> 
      html_elements("a") |> 
      html_attr("href") 
      
    i <- stringr::str_detect(links, "url\\?q\\=")
    out <- stringr::str_extract(links[i], "https.*?(?=\\&)")
    
    out