First question on StackOverFlow. I am trying to web scrape fxstreet.com/news. It seems that their news feed is dynamically producing articles. BeautifulSoup is unable to gather that information, so I have decided to use Selenium. However, I am having trouble using Selenium to access the articles that are displayed.
import requests
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0')
article = driver.find_element_by_link_text('/news')
for post in article:
print(post.text)
I would like to make a scraper that checks periodically for new articles, these articles would have a URL of: https://www.fxstreet.com/news...(endpoint).
However, when I try to look up hrefs/'a' tag, I get many links throughout the website, but none of them are the news articles featured in the live feed. When I look up every single 'div' I get the whole html laid out for me:
<article class="fxs_entriesList_article_with_image ">
<h3 class="fxs_entryHeadline">
<a href="https://www.fxstreet.com/news/gbp-usd-upside-potential-limited-in-covid-19-uncertainties-202004021808" title="GBP/USD upside potential limited in COVID-19 uncertainties">GBP/USD upside potential limited in COVID-19 uncertainties</a>
</h3>
<address class="fxs_entry_metaInfo">
<span class="fxs_article_author">
By <a href="/author/ross-j-burland" rel="nofollow">Ross J Burland</a>
</span> | <time pubdate="" datetime="">18:08 GMT</time>
</address>
</article>
telling me that it exists somewhere, somehow, but I am completely unable to interact with it. So how do I access the links that I need when Selenium is unable to search for 'a' tags, or partial links? I have also tried to look for the exact link using:
elem = driver.find_elements_partial_link("news")
for element in elem:
print(element.get_attribute("innerHTML"))
To no avail. I have also tried putting explicit and implicit waits. Thanks.
Please use the below css to get all the news related links.
h4.fxs_headline_tiny a
additional imports needed for explicit wait.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Your code should be like the below.
url = "https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0"
driver.get(url)
WebDriverWait(driver,120).until(EC.presence_of_element_located((By.CSS_SELECTOR,"h4.fxs_headline_tiny a")))
news_elems = driver.find_elements_by_css_selector("h4.fxs_headline_tiny a")
for ele in news_elems:
print(ele.get_attribute('href'))