Search code examples
pythonselenium-webdriverweb-scrapingrpa

New York Times news scraping using pure python and selenium(via rpaframework)


Im trying to scrap New York Times search result using pure python and selenium(via rpaframework) but I'm not getting it correct. I need to get the title, date, and description. Here is my code so far

When I print the title I'm getting this error

selenium.common.exceptions.InvalidArgumentException: Message: unknown variant //h4[@class='css-2fgx4k'], expected one of css selector, link text, partial link text, tag name, xpath at line 1 column 37

from RPA.Browser.Selenium import Selenium

# Search term
search_term = "climate change"

# Open the NY Times search page and search for the term
browser = Selenium()
browser.open_available_browser("https://www.nytimes.com/search?query=" + search_term)

# Find all the search result articles
articles = browser.find_elements("//ol[@data-testid='search-results']/li")


# Extract title, date, and description for each article and add to the list
for article in articles:
    # Extract the title
    title = article.find_element("//h4[@class='css-2fgx4k']")
    print(title)


# Close the browser window
browser.close_all_browsers()

Any assistance will appreciate.


Solution

  • In full disclosure, I'm the author of the Browserist package. Browserist is lightweight, less verbose extension of the Selenium web driver that makes browser automation even easier. Simply install the package with pip install browserist and try this:

    from browserist import Browser
    from selenium.webdriver.common.by import By
    
    search_term = "climate"
    
    # with Browser() as browser:
        browser.open.url("https://www.nytimes.com/search?query=" + search_term)
        search_result_elements = browser.get.elements("//ol[@data-testid='search-results']/li")
        for element in search_result_elements:
            try:
                title = element.find_element(By.TAG_NAME, "h4").text
                print(title)
            except:
                pass
    

    Notes:

    • The simpler search term climate will yield more, yet relevant results, e.g. climate crisis, but that's up to you to change.
    • It's easier and more robust to target the title by the h4 tag header instead of the the CSS token value that might be changed over time.
    • As not all search result elements are uniform, I protect against breaking errors with the try and except clause.
    • Browserist uses Chrome by default, and you can select other browsers, for instance Firefox, with a few changes:
    from browserist import Browser, BrowserType, BrowserSettings
    
    ...
    
    with Browser(BrowserSettings(type=BrowserType.FIREFOX)) as browser:
    

    Here's what I get, and I hope you find it useful. Let me know if you have any questions?

    Titles printed in the terminal

    Results from NY Times