I am able to change the user agent using the httr
package and create a session with the new user agent. However I am not sure how to use this new user agent with the read_html
function to get the html document using the defined user agent.
I have seen the bug report here, though unfortunately, it’s still not clear to me how to get this to work once you create a session and then have to use the read_html
function.
As an example to edit the user agent, I have the below:
library(rvest)
link = "https://www.bbc.com/"
my_session = session(link)
my_session$response$request$options$useragent
user_agent_new = user_agent("Test User 1")
my_session2 = session(link, user_agent_new)
my_session2$response$request$options$useragent
How do you set the user agent in the rvest::read_html
call?
Note: rvest
and xml2
use httr
under the hood, so I'll introduce httr
in my answer here.
As you note in your post, dynamically setting the User Agent is very straightforward when using the httr
package. As an example I'll use the link you listed above:
library(httr)
# Let's set user agent to a super common one
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
# Query webpage
bbc <- GET("https://www.bbc.com/",
user_agent(ua))
# Confirm it's actually used the desired user agent
bbc$request$options$useragent
#> [1] "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
Now you can compare the User Agent value when using the httr
defaults:
library(httr)
# Query webpage with default user agent
bbc <- GET("https://www.bbc.com/")
# Print default user agent value
bbc$request$options$useragent
#> [1] "libcurl/7.64.1 r-curl/4.3 httr/1.4.2"
Obviously, you can set the User Agent to whatever you want. Here is a list of common User Agents.