Search code examples
pythonbeautifulsoupresponsespeech-recognition

Extracting values from Beautiful Soup


I'm quite new to programming and I'm working on a vocal assistant using Python. I found this code on Github but he doesn't works as he should. Here is the code :

def Play(speech):
if speech.endswith("on YouTube"):
    searchTerm = speech.split()
    response = get("https://www.youtube.com/results?search_query=" + quote(" ".join(searchTerm[:-2])))
    soup = BeautifulSoup(response.text, "html.parser")
    videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
    #Was [:3], changed to [1:4] to try to stop ads
    #Try to remove google ads if possible (May have fixed, but test this)
    names = list()
    links = list()
    for i in range(len(videos)):
        names.insert(i, videos[i]["title"])
        links.insert(i, "https://www.youtube.com" + videos[i]["href"])
    print("I found 3 videos. " + ". ".join(names), links)

The URL passed as parameter in the get() method works correctly, the soup variable too, but there is nothing in "videos" so nothing is printed at the end and I don't know how to resolve this.

Some ideas please :) ?


Solution

  • you cant get the contents of a dynamic website like youtube using requests. sorry to be so direct, but this is the truth.

    you need first to get to the url, then render the response using something like chromium in the background, then pass the results to beautiful soup.

    the rendering will take 1-2 seconds. this is how its done.

    there is a snippet for extracting the dynamic website contents which then are passed to BeautifulSoup

    # pip install playwright
    from playwright.sync_api import sync_playwright
    # after installing you will get prompted
    # to install `chromium`, the `thing` i was talking about
    from bs4 import BeautifulSoup
    
    
    def get_dynamic_soup(url: str) -> BeautifulSoup:
        with sync_playwright() as p:
            # Launch the browser
            browser = p.chromium.launch()
    
            # Open a new browser page
            page = browser.new_page()
    
            # Open our test file in the opened page
            page.goto(url)
    
            # Process extracted content with BeautifulSoup
            soup = BeautifulSoup(page.content(), "html.parser")
    
            browser.close()
    
            return soup
    
    # quote is defined in your code
    _url = "https://www.youtube.com/results?search_query=" + quote(" ".join(searchTerm[:-2]))
    soup = get_dynamic_soup(_url)
    # now you can do whatever you want with the soup
    

    then you can do your stuff:

    videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
    

    to install playwright

    python -m pip install playwright # this installs the python package
    python -m playwright install # this install the chromium executable
    

    docs for installation

    EDIT i found a bug in your code this line

    videos = soup.findAll(attrs={"class":"yt-uix-tile-link"})[1:4]
    

    is wrong because you need to specify the HTML element you want to search for

    a good example is:

    videos = soup.findAll("div", attrs={
        "class": "yt-uix-tile-link"
    })[1:4]
    # or 
    videos = soup.findAll("span", attrs={
        "class": "yt-uix-tile-link"
    })[1:4]
    # or whatever element it is