Search code examples
python-3.xweb-scrapingpython-requests-html

Python 3 and Requests-Html: Trying to scrape a website - not getting the "real" html code back


I'm trying to scrape a website, but I'm not getting the correct, analyzable code back.

I am using python 3.12 and the requests HTML module to scrape the websites. For some of them it works without problems, but for "https://www.ostseewelle.de/sendungen/H%C3%B6rercharts-id379456.html" it doesn't work, although I use the render function of Requests-HTML to execute javascript code on the website. From analyzing the website, I know that the information I am looking for is contained in a tag with the attribute data-label = "artist". But in the HTML contained by the scraping and rendering there is not a single tag...

I don't know what to do, can someone help me and point me in the right direction?

from requests_html import HTML, HTMLSession


charts = {'ODC50': {
            'name': 'ODC50',
            'anz': 50,
            'url': 'https://www.mix1.de/charts/dance50.htm',
            'entry': 'div.charts-main-block',
            'date': '#mix1_content div.mybox_content'
        },
        'DDPHot50': {
            'name': 'DDP Hot50',
            'anz': 50,
            'url': 'https://www.deutsche-dj-playlist.de/hot-50/dance',
            'entry': 'div.list div.entry',
            'date': 'div.header div.title'
        },
        'Ostseewelle': {
            'name': 'Ostseewelle',
            'anz': 20,
            'url': 'https://www.ostseewelle.de/sendungen/H%C3%B6rercharts-id379456.html',
            'entry': 'section',
            'date': 'h3.text-center.titel1'
        }
}

choice = 'Ostseewelle'


chart_site = charts.get(choice).get('url')
session = HTMLSession()
r = session.get(chart_site)
r.html.render(sleep=2, keep_page=True, scrolldown=5, timeout=30)

print(r.status_code)

html = r.html

#print(html.html)

tds = html.xpath('//td[@data-label="Künstler"]')
print(f'Gefundene Einträge: {len(tds)}')


print('Programm beendet')

I don't get the correct HTML code back to parse, the expected code is missing.


Solution

  • The chart data on the page you see is loaded from external URL. To get the info about artists you can use next example:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://enricoostendorf.de/top20/top20eo.php"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    for k in soup.select('[data-label="Künstler"]'):
        l1, l2 = k.get_text(strip=True, separator="|||").split("|||")
        print(l1)
        print(l2)
        print("-" * 80)
    

    Prints:

    ...
    
    --------------------------------------------------------------------------------
    Loi
    "Am I Enough"
    --------------------------------------------------------------------------------
    Nico Santos & Fast Boy
    "Where You Are"
    --------------------------------------------------------------------------------
    Ofenbach
    "Overdrive" (feat. Norma Jean Martine)
    --------------------------------------------------------------------------------
    Robin Schulz, Rita Ora, Tiago PZK
    "I'll Be There"
    --------------------------------------------------------------------------------
    Tate McRae
    "greedy"
    --------------------------------------------------------------------------------
    Dua Lipa
    "Houdini"
    --------------------------------------------------------------------------------