Search code examples
pythonseleniumbeautifulsoupdatasetchatbot

Using selenium to open multiple articles with the same class and scrape data from them


I am trying to create a dataset for my chatbot to learn from by using selenium to scrape data from a website. But the articles I am trying to open have the same class so I have to figure out how to cycle through all of them.

I was able to figure out how to open the first link and scrape the data but I dont know how to click the second and after that third etc.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
import bs4


PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

driver.get("https://www.hybrid.cz/tagy/tesla")

link = driver.find_element_by_class_name("nodeTitle")
link.click()
url = driver.current_url
print(url)
r = requests.get(url, allow_redirects=True)
#print(r.text)
soup = bs4.BeautifulSoup(r.text, 'lxml')
for paragraphs in soup.find_all("div", {"class":"node"}):
    ##print(paragraphs)
    with open('test.txt', 'a', encoding='utf-8') as file:
        #print(paragraphs)
        file.write(str(paragraphs))
time.sleep(5)
driver.back()
link2 = driver.find_element_by_xpath("//div[@class='nodeTitle']")
print(link2)
#link2.click()

And right now I am just trying to get the link to print so I know theres something to click but I have not been able to do that. I would be grateful for any help.

Thank you very much


Solution

  • You can use CSS selector h2 a to grab all links to articles from the page. You can store these links to a list and then use this list to grab the articles. For example:

    import requests
    from bs4 import BeautifulSoup
    
    
    def read_article(url):
        rv = []
        soup = BeautifulSoup(requests.get(url).content, 'html.parser')
        for paragraph in soup.find_all("div", {"class":"node"}):
            rv.append(paragraph.get_text(strip=True, separator=' '))
        return rv
    
    url = 'https://www.hybrid.cz/tagy/tesla'
    
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    # 1. read all urls:
    all_urls = []
    for link in soup.select('h2 a'):
        print(link.text)
        print('https://www.hybrid.cz' + link['href'])
        print('-' * 80)
    
        all_urls.append('https://www.hybrid.cz' + link['href'])
    
    # 2. print all articles from grabbed urls:
    for url in all_urls:
        print(read_article(url))
    

    Prints:

    Tesla opět zdražuje, má k tomu ale pádný důvod: robotické řízení je už v betatestu neuvěřitelné!
    https://www.hybrid.cz/tesla-opet-zdrazuje-ma-k-tomu-ale-padny-duvod-roboticke-rizeni-je-uz-v-betatestu-neuveritelne
    --------------------------------------------------------------------------------
    Tesla zveřejnila finanční výsledky a vytřela analytikům zrak, navýšila příjmy i zisk
    https://www.hybrid.cz/tesla-financni-vysledky-vytrela-analytikum-zrak-navysila-prijmy-i-zisk
    --------------------------------------------------------------------------------
    Tesla rozjíždí betatest robotického řízení, nástup bude extrémně opatrný
    https://www.hybrid.cz/tesla-rozjizdi-betatest-robotickeho-rizeni-nastup-bude-exteremne-opatrny
    --------------------------------------------------------------------------------
    Nové Tesla baterie jsou ještě lepší, než se čekalo: životnost přes 3,5 mil. km!
    https://www.hybrid.cz/nove-tesla-baterie-jsou-jeste-lepsi-nez-se-cekalo-zivotnost-pres-35-mil-km
    --------------------------------------------------------------------------------
    
    ...and so on.