Search code examples
rweb-scrapingrvestyahoo-finance

R: web scraping yahoo.finance after 2019 change


I have been happily web scraping yahoo.finance pages for a long time using code largely borrowed from other stackoverflow answers and it has worked great, however in the last few weeks Yahoo has changed their tables to be collapsible/expandable tables. This has broken the code, and despite my best efforts for a few days I can't fix the bug.

Here is an example of the code that others have used for years (which is then parsed and processed in different ways by different people).

library(rvest)
library(tidyverse)

# Create a URL string
myURL <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL"

# Create a dataframe called df to hold this income statement called df
df <- myURL %>% 
  read_html() %>% 
  html_table(header = TRUE) %>% 
  map_df(bind_cols) %>% 
  as_tibble()

Can anyone help?


EDIT FOR MORE CLARITY:

If you run the above then view df you get

# A tibble: 0 x 0

For an example of the expected outcome, we can try another page yahoo hasn't changed such as the following:

 # Create a URL string
myURL2 <-  "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"

df2 <- myURL2 %>% 
  read_html() %>% 
  html_table(header = FALSE) %>% 
  map_df(bind_cols) %>% 
  as_tibble()

If you view df2 you get a tibble of 59 observations of two variables being the main table on that page, beginning with

Market Cap (intraday)5 [value here] Enterprise value 3 [value here] And so on...


Solution

  • As mentioned in the comment above, here is an alternative that tries to deal with the different table sizes published. I have worked on this and have had help from a friend.

    library(rvest)
    library(tidyverse)
    
    url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL
    
    # Download the data
    raw_table <- read_html(url) %>% html_nodes("div.D\\(tbr\\)")
    
    number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()
    
    if(number_of_columns > 1){
      # Create empty data frame with the required dimentions
      df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
                          stringsAsFactors = F)
    
      # Fill the table looping through rows
      for (i in 1:length(raw_table)) {
        # Find the row name and set it.
        df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\\(start\\)") %>% html_text()
        # Now grab the values
        row_values <- raw_table[i] %>% html_nodes("div.Ta\\(end\\)")
        for (j in 1:(number_of_columns - 1)) {
          df[i, j+1] <- row_values[j] %>% html_text()
        }
      }
    view(df)