Search code examples
rweb-scrapingrvesthttr

Change user agent when using rvest::read_html


I am able to change the user agent using the httr package and create a session with the new user agent. However I am not sure how to use this new user agent with the read_html function to get the html document using the defined user agent.

I have seen the bug report here, though unfortunately, it’s still not clear to me how to get this to work once you create a session and then have to use the read_html function.

As an example to edit the user agent, I have the below:

library(rvest)

link = "https://www.bbc.com/"

my_session = session(link)
my_session$response$request$options$useragent

user_agent_new = user_agent("Test User 1")

my_session2 = session(link, user_agent_new)
my_session2$response$request$options$useragent

How do you set the user agent in the rvest::read_html call?


Solution

  • Note: rvest and xml2 use httr under the hood, so I'll introduce httr in my answer here.

    As you note in your post, dynamically setting the User Agent is very straightforward when using the httr package. As an example I'll use the link you listed above:

    library(httr)
    
    # Let's set user agent to a super common one
    ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
    
    # Query webpage
    bbc <- GET("https://www.bbc.com/",
               user_agent(ua))
    
    # Confirm it's actually used the desired user agent
    bbc$request$options$useragent
    #> [1] "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
    

    Now you can compare the User Agent value when using the httr defaults:

    library(httr)
    
    # Query webpage with default user agent
    bbc <- GET("https://www.bbc.com/")
    
    # Print default user agent value
    bbc$request$options$useragent
    #> [1] "libcurl/7.64.1 r-curl/4.3 httr/1.4.2"
    

    Obviously, you can set the User Agent to whatever you want. Here is a list of common User Agents.