Search code examples
pythonbeautifulsouppython-requests

BeautifulSoup doesn't return the proper HTML


I've been trying to scrape this website for a few days: https://www.spiegel.de/suche/?suchbegriff=letzte%2Bgeneration&erschienenBei=der-spiegel

I've been trying to scrape the site using requests and BeautifulSoup. My final goal is to get all links that include the keywords "Letzte Generation" or "Klimaaktivisten". For now I've been using the following code to get the HTML.

import requests
from bs4 import BeautifulSoup
import os
import pandas as pd
   
os.chdir(Path is here)
    
spiegel_lg_suche = "https://www.spiegel.de/suche/?suchbegriff=letzte%2Bgeneration&seite={}&erschienenBei=der-spiegel"
    
# Leere Liste "linkliste_spiegel" erstellen
linkliste_spiegel_suche = []
    
# Schleife über die Seitenzahl von 1-11
for seitenzahl in range (1, 11):
    # Einsetzer der Zahl in das base_url-Format
    url = spiegel_lg_suche.format(seitenzahl)
    # Inhalte werden in BeautifulSoup geladen
    page = requests.get(url).content
    soup = BeautifulSoup(page, 'html.parser')
        
    (...)

After this excerpt there is code for iterating over different HTML-Tags (that used to work when using the "Letzte Generation"-Tag) as well as saving the all values as a dataframe as well as a csv.

While the code worked when searching the "Letzte Generation"-Tag, it doesn't work for the search page. My instructor looked over the code and showed me that BeautifulSoup shows the page without the search query. He was however unable to help me. I still want to solve the problem just for the sake of it.

Could using Selenium help with the problem?


Solution

  • The content of the website you're trying to access is loaded through ajax. Try replacing the url with :

    url = "https://www.spiegel.de/services/sitesearch/search?segments=spon&q=letzte+generation&page_size=20&page={}"
    

    You won't need bs4 for parsing the result as they are jsons.

    On a separate note Spiegel.de could also detects that you're using a script with the user-agent header and returns a captcha. Try adding :

    page = requests.get(url, headers= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36
    "})
    

    Here is the full code :

    import requests
    import os
    import json   
    
    spiegel_lg_suche = "https://www.spiegel.de/services/sitesearch/search?segments=spon&q=letzte+generation&page_size=20&page={}"
        
    # Leere Liste "linkliste_spiegel" erstellen
    linkliste_spiegel_suche = []
        
    # Schleife über die Seitenzahl von 1-11
    for seitenzahl in range (1, 11):
        # Einsetzer der Zahl in das base_url-Format
        url = spiegel_lg_suche.format(seitenzahl)
        # Inhalte werden in BeautifulSoup geladen
        page = requests.get(url,headers= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"}).text
        content = json.loads(page)
        page_size = content["num_results"]
        print("page",seitenzahl,"returned",page_size,"results")
        for result in content["results"]:
            print("-------------------------------------------")
            print(result)
            print("-------------------------------------------")