I am trying to create a dataset for my chatbot to learn from by using selenium to scrape data from a website. But the articles I am trying to open have the same class so I have to figure out how to cycle through all of them.
I was able to figure out how to open the first link and scrape the data but I dont know how to click the second and after that third etc.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
import bs4
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.hybrid.cz/tagy/tesla")
link = driver.find_element_by_class_name("nodeTitle")
link.click()
url = driver.current_url
print(url)
r = requests.get(url, allow_redirects=True)
#print(r.text)
soup = bs4.BeautifulSoup(r.text, 'lxml')
for paragraphs in soup.find_all("div", {"class":"node"}):
##print(paragraphs)
with open('test.txt', 'a', encoding='utf-8') as file:
#print(paragraphs)
file.write(str(paragraphs))
time.sleep(5)
driver.back()
link2 = driver.find_element_by_xpath("//div[@class='nodeTitle']")
print(link2)
#link2.click()
And right now I am just trying to get the link to print so I know theres something to click but I have not been able to do that. I would be grateful for any help.
Thank you very much
You can use CSS selector h2 a
to grab all links to articles from the page. You can store these links to a list and then use this list to grab the articles. For example:
import requests
from bs4 import BeautifulSoup
def read_article(url):
rv = []
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for paragraph in soup.find_all("div", {"class":"node"}):
rv.append(paragraph.get_text(strip=True, separator=' '))
return rv
url = 'https://www.hybrid.cz/tagy/tesla'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# 1. read all urls:
all_urls = []
for link in soup.select('h2 a'):
print(link.text)
print('https://www.hybrid.cz' + link['href'])
print('-' * 80)
all_urls.append('https://www.hybrid.cz' + link['href'])
# 2. print all articles from grabbed urls:
for url in all_urls:
print(read_article(url))
Prints:
Tesla opět zdražuje, má k tomu ale pádný důvod: robotické řízení je už v betatestu neuvěřitelné!
https://www.hybrid.cz/tesla-opet-zdrazuje-ma-k-tomu-ale-padny-duvod-roboticke-rizeni-je-uz-v-betatestu-neuveritelne
--------------------------------------------------------------------------------
Tesla zveřejnila finanční výsledky a vytřela analytikům zrak, navýšila příjmy i zisk
https://www.hybrid.cz/tesla-financni-vysledky-vytrela-analytikum-zrak-navysila-prijmy-i-zisk
--------------------------------------------------------------------------------
Tesla rozjíždí betatest robotického řízení, nástup bude extrémně opatrný
https://www.hybrid.cz/tesla-rozjizdi-betatest-robotickeho-rizeni-nastup-bude-exteremne-opatrny
--------------------------------------------------------------------------------
Nové Tesla baterie jsou ještě lepší, než se čekalo: životnost přes 3,5 mil. km!
https://www.hybrid.cz/nove-tesla-baterie-jsou-jeste-lepsi-nez-se-cekalo-zivotnost-pres-35-mil-km
--------------------------------------------------------------------------------
...and so on.