Search code examples
htmlrrvestyahoo-finance

R Code: Scrape ETF summary stats from Yahoo Finance


I seek to scrape ETF summary stats from Yahoo finance. For example, the page link is https://finance.yahoo.com/quote/IVV. Below the graph, is the table to scrape and the key fields are NAV, PE Ratio TTM, yield, beta and expense ratio. I previously used the rvest package as follows, but that is no longer working as the page structure has changed

ticker <- "IVV"
url <- paste0("https://finance.yahoo.com/quote/",ticker)
df <- url %>%
      read_html() %>%
      html_table() %>%
      map_df(bind_cols) %>%
      as_tibble()

Any help appreciated


Solution

  • It looks like there is no longer a table element in that link, as the info you're after is now contained in list elements. I have tweaked the code to capture the label and values from each list element.

    library(rvest)
    library(purrr)
    library(dplyr)
    
    ticker <- "IVV"
    url <- paste0("https://finance.yahoo.com/quote/",ticker)
    
    ivv_html <- read_html(url)
    
    node_txt <- ".svelte-tx3nkj" # This contains "table" info of interest
    
    df <- ivv_html %>% 
      html_nodes(paste0(".container", node_txt)) %>%
      map_dfr(~{
        tibble(
          label = html_nodes(.x, paste0(".label", node_txt)) %>% 
            html_text(trim = TRUE)
          ,value = html_nodes(.x, paste0(".value", node_txt)) %>% 
            html_text(trim = TRUE)
        )
      })
    
    df %>% 
      filter(label %in% c("NAV", "PE Ratio (TTM)", "Yield", "Beta (5Y Monthly)", "Expense Ratio (net)"))
    
    # A tibble: 5 × 2
      label               value 
      <chr>               <chr> 
    1 NAV                 519.85
    2 PE Ratio (TTM)      26.22 
    3 Yield               1.37% 
    4 Beta (5Y Monthly)   1.00  
    5 Expense Ratio (net) 0.03% 
    

    Adding .container class will limit the info you're after to just the "table" located under the chart, otherwise all info tagged with the class .svelte-tx3nkj from that page will be extracted.


    UPD 2024-08-23, following HTML structure change:

    node_txt <- "yf-tx3nkj"
    
    ivv_html %>% 
      html_nodes(paste0("ul.", node_txt)) %>% 
      html_nodes(paste0(".", node_txt)) %>% 
      map(~{
        tibble(
          label = html_nodes(.x, paste0(".label.", node_txt)) %>% 
            html_text(trim = TRUE)
          ,value = html_nodes(.x, paste0(".value.", node_txt)) %>%
            html_text(trim = TRUE)
        )
      }) %>% 
      list_rbind()
    
    df %>% 
      filter(label %in% c("NAV", "PE Ratio (TTM)", "Yield", "Beta (5Y Monthly)", "Expense Ratio (net)"))