Search code examples
pythonweb-scrapingbeautifulsouppython-requestsrequest

how to expand content in beautiful soup using a button


I am trying to web scrape this site link. The problem is that the page link remains the same even if I click the expand content button. I need to web scrape all of the news dating back to the first post.

`

      import bs4, requests, 

      rom bs4 import BeautifulSoup

      url = "https://www.internazionale.it/tag/la-settimana"

      html = requests.get(url)

      html.raise_for_status()

      s = BeautifulSoup(html.text, 'html.parser')

      results = s.find('div',  class\_='hentryfeed__container container_full')

      link_articolo = results.find_all('div', class\_='box-article-intro')

      for articolo in link_articolo:

         link_articoli = articolo.find('a', href=True)

     print('https://www.internazionale.it' + link_articoli\['href'\])

This is the working code for page one, but the button that expand the content doesn't change the url code, so I need to find a new solution to web scrape all the news untill the first post


Solution

  • To get all the links you can use this example (emulating the Ajax call using requests):

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.internazionale.it/tag/la-settimana"
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    stream_id = soup.select_one("[data-stream-id]")["data-stream-id"]
    
    # load first links
    links = []
    for article in soup.select(".box-article__data"):
        links.append("https://www.internazionale.it" + article.a["href"])
        data_datetime = article.find_previous(attrs={"data-datetime": True})[
            "data-datetime"
        ].split()[0]
    
    # load rest of the links
    while True:
        url = f"https://data.internazionale.it/stream_data/items/tag/0/{stream_id}/{data_datetime}.json"
        data = requests.get(url).json()
    
        if not data.get("items"):
            break
    
        for i in data["items"]:
            links.append("https://www.internazionale.it" + i["url"])
            print(links[-1])
    
        data_datetime = data["datetime"].split()[0]
    
    # `links` now contains all the links
    

    Prints:

    ...
    
    https://www.internazionale.it/opinione/giovanni-de-mauro/2001/08/02/la-battaglia-di-genova
    https://www.internazionale.it/opinione/giovanni-de-mauro/2001/01/13/astroturf
    https://www.internazionale.it/opinione/giovanni-de-mauro/1999/03/11/tutti-al-centro
    https://www.internazionale.it/opinione/giovanni-de-mauro/1998/05/07/il-futuro-di-israele
    https://www.internazionale.it/opinione/giovanni-de-mauro/1998/04/29/i-nuovi-vicini
    https://www.internazionale.it/opinione/giovanni-de-mauro/1995/12/22/interviste