Search code examples
rweb-scrapingrvest

Web Scraping returns empty in R


I'm trying to scrape prices from Bloomberg. I can get the current price as shown below but can't get the previous price. What's the wrong?

library(rvest)

url <- "https://www.bloomberg.com/quote/WORLD:IND"

price <- read_html(url) %>% 
  html_nodes("div.overviewRow__66339412a5 span.priceText__06f600fa3e") %>% 
  html_text()

prevprice <- read_html(url) %>% 
  html_nodes("div.value__7e29a7c90d") %>% 
  html_text() #returns 0

prevprice <- read_html(url) %>% 
  html_nodes(xpath = '//section') %>%
  html_text() %>% 
  as.data.frame() #didn't find the price

Thanks in advance.


Solution

  • So, there are at least two initial options:

    1. Extract from the script tag where that info is pulled from. When browser runs JavaScript this info is used to populate the page as you see it. With rvest/httr, JavaScript is not run, so you would need to extract from the script tag, rather than where it ends up on the rendered webpage.
    2. Or, you can calculate the previous price using the percentage change and current price. There might be a very small margin of inaccuracy here through rounding.

    I show both of the above options in the code below.

    I've also adapted the css selector list to use attribute = value css selectors, with starts with operator (^). This is to make the code more robust as the classes in the html appear to be dynamic, with only the start of the class attribute value being stable.


    library(httr2)
    library(tidyverse)
    library(rvest)
    
    url <- "https://www.bloomberg.com/quote/WORLDT:IND"
    headers <- c("user-agent" = "mozilla/5.0")
    
    page <- request(url) |>
      (\(x) req_headers(x, !!!headers))() |>
      req_perform() |>
      resp_body_html()
    
    # extract direct
    prev_price <- page |>
      html_text() |>
      stringr::str_match("previousClosingPriceOneTradingDayAgo%22%3A(\\d+\\.?\\d+?)%2C") |>
      .[, 2]
    curr_price <- page |>
      html_element("[class^=priceText]") |>
      html_text() |>
      str_replace_all(",", "") |>
      as.numeric()
    
    # calculate
    change <- page |>
      html_element("[class^=changePercent]") |>
      html_text() |>
      str_extract("[\\d\\.]+") |>
      as.numeric()
    adjustment <- 100 - change
    prev_price_calculated <- curr_price * (adjustment / 100)
    
    print(curr_price)
    print(change)
    print(prev_price)
    print(prev_price_calculated)