Search code examples
rrvesthttr

rvest follow_link brings me back to the same page


I am trying to get text from a news website. The search bring me to the pagination sequence which I usually solve with rvest follow_link. How ever in this case I am still getting back to Page 1 instead to page 2, page 3, etc...

Any idea why is this happening?

library(tidyverse)
library(rvest)         
library(httr)

url = "https://www.milenio.com"
UserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"
MySession = html_session(
                        url = url,
                        user_agent(UserAgent)
                        )

page = MySession %>%
  jump_to(url = 'buscador/page/2?text=violencia')

page

page2 = page %>% 
  follow_link(css = ".number-pages-container span:nth-child(2) a")

page2

Solution

  • I added some additional headers and followed the sequence of search page > page with query string > page 2 link. I did this on the basis that I believe a certain sequence of cookies are required.

    library(tidyverse)
    library(rvest)         
    library(httr)
    
    url = "https://www.milenio.com"
    
    MySession = html_session(
      url = 'https://www.milenio.com/buscador',
      add_headers('accept-language' ='en-GB,en-US;q=0.9,en;q=0.8', 
                  'user-agent' ="Mozilla/5.0",
                  'referer' = 'https://www.milenio.com/buscador',
      )
    )
    
    page <- MySession %>%
      session_jump_to(url = '/buscador?text=violencia')
    
    page2 <- page %>% 
      session_follow_link(css = ".number-pages-container span:nth-child(2) a")
    
    page2 %>% html_element('.headline-number') %>% html_text()