Search code examples
pythonhtmlbeautifulsoupyoutube

Can someone help me properly scrape YouTube titles in Python using BS4?


i wanna collect youtube titles from useing BS4 in python. this is code i got recommended by GPT but doesnt work well. im looking for some intelligent coder here. thank you :)

import requests
from bs4 import BeautifulSoup

def get_youtube_titles():
url = 'https://www.youtube.com/'

    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
    
        # Find YouTube title elements
        title_elements = soup.find_all('a', class_='yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media')
    
        # Extract and print the titles
        for title_element in title_elements:
            title = title_element.text.strip()
            print(title)
    
    except requests.exceptions.RequestException as e:
        print('Network connection error:', e)

# Get YouTube titles

get_youtube_titles()

I asked to GPT but doesn't work well


Solution

  • Your code is using requests.get so you'll only get the source html, which is not the same as the fully rendered HTML you might inspect on your browser. For that, you might want to use something that supports JavaScript (like selenium - and don't forget to add in some wait time to allow the page to load....).

    However, if all you want are some titles, you can try extracting from the script tags that contain the JavaScript with the following functions:

    # import json
    
    ## a general function for extracting a JavaScript variable from a bs4 object
    def get_jsScriptVal(jSoup, valDecl, isJson=True):
        script_finder = lambda s: s and valDecl in s
        for sc in jSoup.find('script', string=script_finder):
            for st in  sc.string.split(';'):
                ls, rs, *_ = [s.strip() for s in (st.split('=', 1) + [''])]
                if ls == valDecl and rs: return json.loads(rs) if isJson else rs
    
    
    ## specifically for your case
    def get_ytInitialTitles(ySoup):
        contents = get_jsScriptVal(ySoup, 'var ytInitialData')['contents']
        tab1 = contents['twoColumnBrowseResultsRenderer']['tabs'][0]
        contents = tab1['tabRenderer']['content']['richGridRenderer']['contents']
        contents = [c['richItemRenderer']['content']['videoRenderer'] 
                    for c in contents if 'richItemRenderer' in c and 
                    'videoRenderer' in c['richItemRenderer']['content']]
        titles = [c['title']['runs'][0]['text'] for c in contents]
        return titles
    

    Now, if you edit your code to use the functions above:

    import requests
    from bs4 import BeautifulSoup
    import json
    
    #### DON'T FORGET TO PASTE THE FUNCTION DEFINITIONS INTO YOUR CODE TOO ####
    ## def get_jsScriptVal....
    ## def get_ytInitialTitles....
    ##########################################################################
    
    def get_youtube_titles():
        url = 'https://www.youtube.com/'
        try:
            response = requests.get(url)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')
        
            # titles = get_ytInitialTitles(soup) # Find YouTube title elements
            # for title in titles: print(title) # Extract and print the titles
    
            # OR [in one line]
            for title in get_ytInitialTitles(soup): print(title)
        
        except Exception as e:
            print('Failed to scrape due to', type(e), ':', e)
    
    get_youtube_titles()
    

    then it should print something like

    Survive 100 Days In Circle, Win $500,000
    lofi hip hop radio 📚 - beats to relax/study to
    Spectaculair ingekleurde film over het begin van de Duitse bezetting van Nederland tijdens WOII
    Omtzigt is WOEST & SLOOPT liegende Rutte! 'Kijk die ouders in hun ogen!'
    Ineens vielen er bommen op zonnepanelen... Algemene beschouwingen Venlo 2023
    Trump Opens Up on Secret White House Documents, Biden Family & Republican Opponents | Trump LIVE
    An einem Tag nach Mallorca und zurück: Was verdient ein Flugbegleiter? | Lohnt sich das | BR
    I BUILT A SHELTER IN THE FOREST!! AND LIVED THERE FOR 2 MONTHS!
    De halvering van China
    Ibiza Summer Mix 2023 🍓 Best Of Tropical Deep House Music Chill Out Mix 2023🍓 Chillout Lounge #153
    Tibetaanse Genezende Fluit • Afgifte van melatonine en gifstoffen • Elimineer stress en kalmeer ...
    Alle 200 POTLODEN GEBRUIKEN in 1 TEKENING - Tekenen Challenge
    Top 10 BEST Auditions on BGT 2023!
    Ontspannende muziek tot opluchting stress, angst en depressie 🐬 Verzachtende muziek voor zenuwen
    6 juni 1944, D-Day, Operatie Overlord | Ingekleurd
    Ed Sheeran, Martin Garrix, Kygo, Dua Lipa, Avicii, Robin Schulz, The Chainsmokers Style - Feeling Me
    DIY with Mr Bean | Full Episodes | Classic Mr Bean
    EEN WEDSTRIJD VOL AFSCHEID! 😭🫡 | Barcelona vs Mallorca | La Liga 2022/23 | Samenvatting
    Deep Focus Music To Improve Concentration - 12 Hours of Ambient Study Music to Concentrate #506
    The Inside Guys React To The Miami Heat's Blowout Game 7 Win In Boston | NBA on TNT
    Muziek genezen om stress, vermoeidheid, depressie, negativiteit, detoxemoties te verlichten
    How Rain Caused Havoc And Changed The Race | 2023 Monaco Grand Prix