I have been happily web scraping yahoo.finance pages for a long time using code largely borrowed from other stackoverflow answers and it has worked great, however in the last few weeks Yahoo has changed their tables to be collapsible/expandable tables. This has broken the code, and despite my best efforts for a few days I can't fix the bug.
Here is an example of the code that others have used for years (which is then parsed and processed in different ways by different people).
library(rvest)
library(tidyverse)
# Create a URL string
myURL <- "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL"
# Create a dataframe called df to hold this income statement called df
df <- myURL %>%
read_html() %>%
html_table(header = TRUE) %>%
map_df(bind_cols) %>%
as_tibble()
Can anyone help?
EDIT FOR MORE CLARITY:
If you run the above then view df you get
# A tibble: 0 x 0
For an example of the expected outcome, we can try another page yahoo hasn't changed such as the following:
# Create a URL string
myURL2 <- "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"
df2 <- myURL2 %>%
read_html() %>%
html_table(header = FALSE) %>%
map_df(bind_cols) %>%
as_tibble()
If you view df2 you get a tibble of 59 observations of two variables being the main table on that page, beginning with
Market Cap (intraday)5 [value here] Enterprise value 3 [value here] And so on...
As mentioned in the comment above, here is an alternative that tries to deal with the different table sizes published. I have worked on this and have had help from a friend.
library(rvest)
library(tidyverse)
url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL
# Download the data
raw_table <- read_html(url) %>% html_nodes("div.D\\(tbr\\)")
number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()
if(number_of_columns > 1){
# Create empty data frame with the required dimentions
df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
stringsAsFactors = F)
# Fill the table looping through rows
for (i in 1:length(raw_table)) {
# Find the row name and set it.
df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\\(start\\)") %>% html_text()
# Now grab the values
row_values <- raw_table[i] %>% html_nodes("div.Ta\\(end\\)")
for (j in 1:(number_of_columns - 1)) {
df[i, j+1] <- row_values[j] %>% html_text()
}
}
view(df)