NOTE: this question does not violate Google's Terms of Service. The use case and purposes are compliant and within the ToS.
The object url
is an atomic character vector that stores a URL:
url <- "https://www.google.com/search?q=Kendrick+Lamar+aoty"
I need to obtain the first non-ad link that Google's search results return and store it as the object link
. Here's what I do now:
library(rvest)
url <- "https://www.google.com/search?q=Kendrick+Lamar+aoty"
search_page <- read_html(url)
first_link <- search_page %>%
html_nodes(".g:not(.g .adsbygoogle)") %>%
html_node("a") %>%
html_attr("href") %>%
URLdecode()
I get the following error:
Error in parse_simple_selector(stream) : Expected ')', got .
The issue looks like my call to html_nodes()
function. I have modified the CSS classifier in several ways but have not been successful in resolving the error. The result I'm expecting to store in the first_link
object based on the example above is:
https://www.albumoftheyear.org/artist/1881-kendrick-lamar/
What am I doing wrong with the html_nodes()
function (or something else that I'm missing if I want to get the desired output as shown above)?
Is this solution good for you?
library(rvest)
url <- "https://www.google.com/search?q=Kendrick+Lamar+aoty"
search_page <- read_html(url)
links <- search_page |>
html_elements("a") |>
html_attr("href")
i <- stringr::str_detect(links, "url\\?q\\=")
out <- stringr::str_extract(links[i], "https.*?(?=\\&)")
out