Search code examples
rweb-scrapingdownloadrvest

How to download data from macrotrends web site with R?


I found some time ago this R code edited by Greg (here) and it worked very well for a long time. Unfortunately, some time ago, it stopped working, (at least for me) and I am wondering if someone could help to fix the problem if it is possible.

library(rvest)
webpage <- read_html("https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio")
html <- rvest::html_nodes(webpage, "thead+ thead th , #style-1 td")
results <- rvest::html_text(html)

Date <- results[seq(5, length(results), 4)]
`Stock Price` <- results[seq(6, length(results), 4)]
`TTM Net EPS` <- results[seq(7, length(results), 4)]
`PE Ratio` <- results[seq(8, length(results), 4)]
results <- data.frame(Date, `Stock Price`, `TTM Net EPS`, `PE Ratio`, stringsAsFactors = FALSE)

Returning

head(results)
        Date Stock.Price TTM.Net.EPS PE.Ratio
1 2020-04-27     1060.52                16.25
2 2020-02-29     1032.51      $65.27    15.82
3 2019-11-30     1177.92      $64.37    18.30
4 2019-08-31     1101.69      $63.54    17.34
5 2019-05-31     1027.11      $55.97    18.35
6 2019-02-28      938.97      $53.40    17.58

But as I said, sadly seem not working anymore. If somebody could help, It will be very nice for R world.


Solution

  • At least in R & RStudio running on Windows, User-Agent in rvest::read_html() request header is something like:

    RStudio Desktop (2023.6.2.529); R (4.2.3 x86_64-w64-mingw32 x86_64 mingw32)
    

    Apararently they really dislike the RStudio Desktop part, a demo with httr2 for convenience.
    Set User-Agent to something that includes RStudio Desktop and the request will fail:

    library(httr2)
    url_ <- "https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio"
    
    request(url_) |>
      req_user_agent("RStudio Desktop") |>
      req_perform(verbosity = 1)
    #> -> GET /stocks/charts/azo/autozone/pe-ratio HTTP/1.1
    #> -> Host: www.macrotrends.net
    #> -> User-Agent: RStudio Desktop
    #> -> Accept: */*
    #> -> Accept-Encoding: deflate, gzip
    #> -> 
    #> <- HTTP/1.1 403 Forbidden
    #> <- Connection: close
    #> <- Content-Length: 420
    #> <- Server: Varnish
    #> <- Retry-After: 0
    #> <- Content-Type: text/html; charset=utf-8
    #> <- Accept-Ranges: bytes
    #> <- Date: Fri, 22 Sep 2023 13:16:06 GMT
    #> <- Via: 1.1 varnish
    #> <- X-Served-By: cache-hel1410024-HEL
    #> <- X-Cache: MISS
    #> <- X-Cache-Hits: 0
    #> <- X-Timer: S1695388567.783692,VS0,VE0
    #> <-
    #> Error in `req_perform()`:
    #> ! HTTP 403 Forbidden.
    #> Backtrace:
    #>     ▆
    #>  1. └─httr2::req_perform(req_user_agent(request(url_), "RStudio Desktop"), verbosity = 1)
    #>  2.   └─httr2:::resp_abort(resp, error_body(req, resp), call = error_call)
    #>  3.     └─rlang::abort(...)
    

    Change one character in User-Agent and we are worthy of a status 200 and content:

    request(url_) |>
      req_user_agent("RStudio.Desktop") |>
      req_perform(verbosity = 1)
    #> -> GET /stocks/charts/azo/autozone/pe-ratio HTTP/1.1
    #> -> Host: www.macrotrends.net
    #> -> User-Agent: RStudio.Desktop
    #> -> Accept: */*
    #> -> Accept-Encoding: deflate, gzip
    #> -> 
    #> <- HTTP/1.1 200 OK
    #> <- Connection: keep-alive
    #> <- Content-Length: 14962
    #> <- Server: Apache/2.4.18 (Ubuntu)
    #> <- Cache-Control: no-cache, no-store, must-revalidate
    #> <- Pragma: no-cache
    #> <- Expires: 0
    #> <- Content-Encoding: gzip
    #> <- Content-Type: text/html; charset=UTF-8
    #> <- Via: 1.1 varnish, 1.1 varnish
    #> <- Accept-Ranges: bytes
    #> <- Date: Fri, 22 Sep 2023 13:16:07 GMT
    #> <- Age: 1763
    #> <- X-Served-By: cache-iad-kjyo7100051-IAD, cache-hel1410025-HEL
    #> <- X-Cache: MISS, HIT
    #> <- X-Cache-Hits: 0, 2
    #> <- X-Timer: S1695388567.044209,VS0,VE0
    #> <- Vary: Accept-Encoding
    #> <-
    #> <httr2_response>
    #> GET https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio
    #> Status: 200 OK
    #> Content-Type: text/html
    #> Body: In memory (60149 bytes)
    

    When switching to rvest::session(), things magically work as the default User-Agent for session() requests is different:

    libcurl/7.84.0 r-curl/5.0.2 httr/1.4.7
    

    So either use session() instead of read_html() as suggested in comments or use some other means to request a page content with a different user-agent, you can still parse the response with rvest. Here's an example with httr2:

    library(rvest)
    library(httr2)
    # make a request with httr2 ..
    request("https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio") |>
      req_user_agent("libcurl") |>
      req_perform() |>
      resp_body_html() |>
      #... and parse with rvest:
      html_elements("thead+ thead th , #style-1 td")
    #> {xml_nodeset (232)}
    #>  [1] <th style="text-align:center;">Date</th>
    #>  [2] <th style="text-align:center;">Stock Price</th>
    #>  [3] <th style="text-align:center;">TTM Net EPS</th>
    #>  [4] <th style="text-align:center;">PE Ratio</th>
    #>  [5] <td style="text-align:center;">2023-09-21</td>
    #>  [6] <td style="text-align:center;">2530.76</td>
    #>  [7] <td style="text-align:center;"></td>
    #>  [8] <td style="text-align:center;">29.36</td>
    #>  [9] <td style="text-align:center;">2023-08-31</td>
    #> [10] <td style="text-align:center;">2531.33</td>
    #> [11] <td style="text-align:center;">$86.21</td>
    #> [12] <td style="text-align:center;">29.36</td>
    #> [13] <td style="text-align:center;">2023-05-31</td>
    #> [14] <td style="text-align:center;">2386.84</td>
    #> [15] <td style="text-align:center;">$126.72</td>
    #> [16] <td style="text-align:center;">18.84</td>
    #> [17] <td style="text-align:center;">2023-02-28</td>
    #> [18] <td style="text-align:center;">2486.54</td>
    #> [19] <td style="text-align:center;">$121.63</td>
    #> [20] <td style="text-align:center;">20.44</td>
    #> ...
    

    Created on 2023-09-22 with reprex v2.0.2

    What makes it a bit more tricky to debug is the fact that the execution method plays a role here, i.e. OP's code fails when it's run in RStudio interactive session, but when executining through reprex, it works as that offending string is not included in User-Agent.

    Just for the reference, this is how rvest::read_html() fails with 403 in RStudio:

    library(rvest)
    read_html("https://www.macrotrends.net/stocks/charts/azo/autozone/pe-ratio")
    #> Error in open.connection(x, "rb") : HTTP error 403.
    
    sessioninfo::platform_info()
    #>  setting  value
    #>  version  R version 4.2.3 (2023-03-15 ucrt)
    #>  os       Windows 10 x64 (build 19045)
    #>  system   x86_64, mingw32
    #>  ui       RTerm
    #>  language (EN)
    #>  collate  Estonian_Estonia.utf8
    #>  ctype    Estonian_Estonia.utf8
    #>  tz       Europe/Helsinki
    #>  date     2023-09-22
    #>  pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
    

    Created on 2023-09-22 with reprex v2.0.2