I want to web scrape stocks' financial tables for different years with R. However, I can obtain the financial tables for the last year, which appears as default. But I also want to obtain data from previous years. How can I achieve this? Here is the code I use:
# Load libraries
library(tidyverse)
library(rvest)
library(readxl)
library(magrittr)
google_finance <- read_html("https://www.google.com/finance/quote/AAPL:NASDAQ?") |>
html_node(".UulDgc") |>
html_table()
And the result is:
> google_finance |>
+ head(5)
# A tibble: 5 × 3
`(USD)` Mar 2024infoFiscal Q…¹ `Y/Y change`
<chr> <chr> <chr>
1 "RevenueThe tot… 90.75B -4.31%
2 "Operating expe… 14.37B 5.22%
3 "Net incomeComp… 23.64B -2.17%
4 "Net profit mar… 26.04 2.20%
5 "Earnings per s… 1.53 0.66%
As you can see, we can only see the financial tables of the last period (March 2024). In this case, what should we do to scrape the financial tables for all years?
I think you will need to use RSelenium
for this, which will launch a browser and click on buttons for you. Here I am using firefox as a browser, you might need to change some default settings to get your browser settings correct. You will also need Java SDK installed.
library(RSelenium)
library(rvest)
library(glue)
# Initiate a Remote Driver using forefox; this step may also install some pre
# and post binary files.
rd <- rsDriver(browser = "firefox", chromever = NULL)
# Assign client
remDr <- rd$client
url <- "https://www.google.com/finance/quote/AAPL:NASDAQ"
# Extract names of buttons
aapl_html <- read_html(url)
btn_names <- aapl_html %>%
html_node(".zsnTKc") %>%
html_attr("aria-owns") %>%
strsplit(., split = " ") %>%
unlist()
# Using the Remote Driver, navigate to url of interest
remDr$navigate(url)
# In a loop, find button of interest by its xpath, click and extract table
df_ls <- lapply(
X = btn_names
,FUN = function(x) {
# Find button using xPath
btn <- remDr$findElement(using = "xpath", glue("//*[@id='{x}']"))
# Nifty trick to visually see which button is being clicked
btn$highlightElement()
# Click the button
btn$clickElement()
# Wait for elements to complete loading
Sys.sleep(1)
# Read HTML after each button is clicked
rem_aapl_html <- remDr$getPageSource()[[1]]
# Extract table
aapl_tbl <- rem_aapl_html %>%
read_html() %>%
html_node(".slpEwd") %>%
html_table()
}
)
# Close Remote Driver and server
remDr$close()
rd$server$stop()