I am using rvest
to get the hyperlinks in a Google search. User @AllanCameron helped me in the past to sketch this code but now I do not know how to change the xpath or what I need to do in order to get the links. Here my code:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
links <- html_nodes(first_page, xpath = "//div/div/a/h3") %>%
html_attr('href')
Which entirely returns NA
.
I would like to get the links for each item that appears like next (sorry for the quality of images):
Is possible to get that stored in a dataframe? Many thanks!
Look at the parents a
of the h3
nodes and find their href
attribute. This ensures you have the same number of links as the main titles, to allow for easy arrangement in a dataframe.
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")
titles %>%
html_elements(xpath = "./parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")
[1] "https://www.linkedin.com/in/mario-torres-b5796315b"
[2] "https://mariolopeztorres.com/"
[3] "https://www.instagram.com/mario_torres25/%3Fhl%3Den"
[4] "https://www.1stdibs.com/buy/mario-torres-lopez/"
[5] "https://m.facebook.com/2064681987175832"
[6] "https://www.facebook.com/mariotorresmx"
[7] "https://www.transfermarkt.us/mario-torres/profil/spieler/28167"
[8] "https://en.wikipedia.org/wiki/Mario_Garc%25C3%25ADa_Torres"
[9] "https://circawho.com/press-and-magazines/mario-lopez-torress-legacy-is-still-being-woven-in-michoacan-mexico/"