I am trying to get text from a news website. The search bring me to the pagination sequence which I usually solve with rvest follow_link
. How ever in this case I am still getting back to Page 1 instead to page 2, page 3, etc...
Any idea why is this happening?
library(tidyverse)
library(rvest)
library(httr)
url = "https://www.milenio.com"
UserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"
MySession = html_session(
url = url,
user_agent(UserAgent)
)
page = MySession %>%
jump_to(url = 'buscador/page/2?text=violencia')
page
page2 = page %>%
follow_link(css = ".number-pages-container span:nth-child(2) a")
page2
I added some additional headers and followed the sequence of search page > page with query string > page 2 link. I did this on the basis that I believe a certain sequence of cookies are required.
library(tidyverse)
library(rvest)
library(httr)
url = "https://www.milenio.com"
MySession = html_session(
url = 'https://www.milenio.com/buscador',
add_headers('accept-language' ='en-GB,en-US;q=0.9,en;q=0.8',
'user-agent' ="Mozilla/5.0",
'referer' = 'https://www.milenio.com/buscador',
)
)
page <- MySession %>%
session_jump_to(url = '/buscador?text=violencia')
page2 <- page %>%
session_follow_link(css = ".number-pages-container span:nth-child(2) a")
page2 %>% html_element('.headline-number') %>% html_text()