I found some time ago this R code edited by Greg (here) and it worked very well for a long time. Unfortunately, some time ago, it stopped working, (at least for me) and I am wondering if someone could help to fix the problem if it is possible.
library(rvest)
webpage <- read_html("https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio")
html <- rvest::html_nodes(webpage, "thead+ thead th , #style-1 td")
results <- rvest::html_text(html)
Date <- results[seq(5, length(results), 4)]
`Stock Price` <- results[seq(6, length(results), 4)]
`TTM Net EPS` <- results[seq(7, length(results), 4)]
`PE Ratio` <- results[seq(8, length(results), 4)]
results <- data.frame(Date, `Stock Price`, `TTM Net EPS`, `PE Ratio`, stringsAsFactors = FALSE)
Returning
head(results)
Date Stock.Price TTM.Net.EPS PE.Ratio
1 2020-04-27 1060.52 16.25
2 2020-02-29 1032.51 $65.27 15.82
3 2019-11-30 1177.92 $64.37 18.30
4 2019-08-31 1101.69 $63.54 17.34
5 2019-05-31 1027.11 $55.97 18.35
6 2019-02-28 938.97 $53.40 17.58
But as I said, sadly seem not working anymore. If somebody could help, It will be very nice for R world.
At least in R & RStudio running on Windows, User-Agent
in rvest::read_html()
request header is something like:
RStudio Desktop (2023.6.2.529); R (4.2.3 x86_64-w64-mingw32 x86_64 mingw32)
Apararently they really dislike the RStudio Desktop
part, a demo with httr2
for convenience.
Set User-Agent to something that includes RStudio Desktop
and the request will fail:
library(httr2)
url_ <- "https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio"
request(url_) |>
req_user_agent("RStudio Desktop") |>
req_perform(verbosity = 1)
#> -> GET /stocks/charts/azo/autozone/pe-ratio HTTP/1.1
#> -> Host: www.macrotrends.net
#> -> User-Agent: RStudio Desktop
#> -> Accept: */*
#> -> Accept-Encoding: deflate, gzip
#> ->
#> <- HTTP/1.1 403 Forbidden
#> <- Connection: close
#> <- Content-Length: 420
#> <- Server: Varnish
#> <- Retry-After: 0
#> <- Content-Type: text/html; charset=utf-8
#> <- Accept-Ranges: bytes
#> <- Date: Fri, 22 Sep 2023 13:16:06 GMT
#> <- Via: 1.1 varnish
#> <- X-Served-By: cache-hel1410024-HEL
#> <- X-Cache: MISS
#> <- X-Cache-Hits: 0
#> <- X-Timer: S1695388567.783692,VS0,VE0
#> <-
#> Error in `req_perform()`:
#> ! HTTP 403 Forbidden.
#> Backtrace:
#> ▆
#> 1. └─httr2::req_perform(req_user_agent(request(url_), "RStudio Desktop"), verbosity = 1)
#> 2. └─httr2:::resp_abort(resp, error_body(req, resp), call = error_call)
#> 3. └─rlang::abort(...)
Change one character in User-Agent and we are worthy of a status 200 and content:
request(url_) |>
req_user_agent("RStudio.Desktop") |>
req_perform(verbosity = 1)
#> -> GET /stocks/charts/azo/autozone/pe-ratio HTTP/1.1
#> -> Host: www.macrotrends.net
#> -> User-Agent: RStudio.Desktop
#> -> Accept: */*
#> -> Accept-Encoding: deflate, gzip
#> ->
#> <- HTTP/1.1 200 OK
#> <- Connection: keep-alive
#> <- Content-Length: 14962
#> <- Server: Apache/2.4.18 (Ubuntu)
#> <- Cache-Control: no-cache, no-store, must-revalidate
#> <- Pragma: no-cache
#> <- Expires: 0
#> <- Content-Encoding: gzip
#> <- Content-Type: text/html; charset=UTF-8
#> <- Via: 1.1 varnish, 1.1 varnish
#> <- Accept-Ranges: bytes
#> <- Date: Fri, 22 Sep 2023 13:16:07 GMT
#> <- Age: 1763
#> <- X-Served-By: cache-iad-kjyo7100051-IAD, cache-hel1410025-HEL
#> <- X-Cache: MISS, HIT
#> <- X-Cache-Hits: 0, 2
#> <- X-Timer: S1695388567.044209,VS0,VE0
#> <- Vary: Accept-Encoding
#> <-
#> <httr2_response>
#> GET https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio
#> Status: 200 OK
#> Content-Type: text/html
#> Body: In memory (60149 bytes)
When switching to rvest::session()
, things magically work as the default User-Agent for session()
requests is different:
libcurl/7.84.0 r-curl/5.0.2 httr/1.4.7
So either use session()
instead of read_html()
as suggested in comments or use some other means to request a page content with a different user-agent, you can still parse the response with rvest
. Here's an example with httr2
:
library(rvest)
library(httr2)
# make a request with httr2 ..
request("https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio") |>
req_user_agent("libcurl") |>
req_perform() |>
resp_body_html() |>
#... and parse with rvest:
html_elements("thead+ thead th , #style-1 td")
#> {xml_nodeset (232)}
#> [1] <th style="text-align:center;">Date</th>
#> [2] <th style="text-align:center;">Stock Price</th>
#> [3] <th style="text-align:center;">TTM Net EPS</th>
#> [4] <th style="text-align:center;">PE Ratio</th>
#> [5] <td style="text-align:center;">2023-09-21</td>
#> [6] <td style="text-align:center;">2530.76</td>
#> [7] <td style="text-align:center;"></td>
#> [8] <td style="text-align:center;">29.36</td>
#> [9] <td style="text-align:center;">2023-08-31</td>
#> [10] <td style="text-align:center;">2531.33</td>
#> [11] <td style="text-align:center;">$86.21</td>
#> [12] <td style="text-align:center;">29.36</td>
#> [13] <td style="text-align:center;">2023-05-31</td>
#> [14] <td style="text-align:center;">2386.84</td>
#> [15] <td style="text-align:center;">$126.72</td>
#> [16] <td style="text-align:center;">18.84</td>
#> [17] <td style="text-align:center;">2023-02-28</td>
#> [18] <td style="text-align:center;">2486.54</td>
#> [19] <td style="text-align:center;">$121.63</td>
#> [20] <td style="text-align:center;">20.44</td>
#> ...
Created on 2023-09-22 with reprex v2.0.2
What makes it a bit more tricky to debug is the fact that the execution method plays a role here, i.e. OP's code fails when it's run in RStudio interactive session, but when executining through reprex
, it works as that offending string is not included in User-Agent.
Just for the reference, this is how rvest::read_html()
fails with 403 in RStudio:
library(rvest)
read_html("https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio")
#> Error in open.connection(x, "rb") : HTTP error 403.
sessioninfo::platform_info()
#> setting value
#> version R version 4.2.3 (2023-03-15 ucrt)
#> os Windows 10 x64 (build 19045)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Estonian_Estonia.utf8
#> ctype Estonian_Estonia.utf8
#> tz Europe/Helsinki
#> date 2023-09-22
#> pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
Created on 2023-09-22 with reprex v2.0.2