Search code examples
web-scrapingxpathcss-selectorsrselenium

RSelenium To Scrape Yahoo Financial News Headlines


I would like to get news headings for a company from Yahoo. I use RSelenium to start a remote browser and accept cookies. I found the surroung css class "StretchedBox" and I can literally see the headline by browser inspection. How can I store these headings? Next, I would like to scroll down with RSelenium and save more of these elements (say for several days).

library('RSelenium')

# Start Remote Browser
rD <- rsDriver(port = 4840L, browser = c("firefox")) 
remDr <- rD[["client"]]

# Navigate to Yahoo Finance News for Specific Company
# This takes unusual long time 
remDr$navigate("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")

# Get "accept all cookies" botton
webElems <- remDr$findElements(using = "xpath", "//button[starts-with(@class, 'btn primary')]") 

# We can check if we did get the proper button by checking the text of the element:
unlist(lapply(webElems, function(x) {x$getElementText()}))

# We found the two button, and we want to click the first one:
webElems[[1]]$clickElement()

# wait for page loading
Sys.sleep(5) 

# I am looking for news headline in or after the StretchedBox
boxes <- remDr$findElements(using = "class", "StretchedBox")
boxes[1] # empty

boxes[[1]]$browserName  

enter image description here


Solution

  • Finally, I found an xpath from which I could getElementText the news article headlines.

    library('RSelenium')
    
    # Start Browser
    rD <- rsDriver(port = 4835L, browser = c("firefox")) 
    remDr <- rD[["client"]]
    
    # Navigate to Yahoo Financial News
    remDr$navigate("https://finance.yahoo.com/quote/AAPL/news?p=AAPL")
    
    # Click Accept Cookies
    webElems <- remDr$findElements(using = "xpath", "//button[starts-with(@class, 'btn primary')]")
    unlist(lapply(webElems, function(x) {x$getElementText()}))
    webElems[[1]]$clickElement()
    
    # extract headlines from html/css by xpath 
    headlines <- remDr$findElements(using = "xpath", "//h3[@class = 'Mb(5px)']//a")
    
    # extract headline text
    headlines <- sapply(headlines, function(x){x$getElementText()})
    headlines[1]
    
    [[1]]
    [1] "What Kind Of Investors Own Most Of Apple Inc. (NASDAQ:AAPL)?"