Search code examples
pythonseleniumyoutubeyoutube-dl

What is the fastest/ most lightweight way of getting html after javascript have excuted?


The problem is that youtube API for searching is very limiting, so i've resorted to webscraping the search result page. So far i've tried to use seleiunm to load the page and get the html, but it have quite a bit of delay when starting up.

Without Javascript, youtube search result page will not get generated properly, so I cant just run a get request on the URL.

Is there any other ways to get the rendered search result page?

My code right now

    def search(self, query):
        try:

            self.driver.get('https://www.youtube.com/results?search_query={}'.format(str(query)))

            self.wait.until(self.visible((By.ID, "video-title")))
            elements=self.driver.find_elements(By.XPATH,"//*[@id=\"video-title\"]")
            results = []
            for element in elements:
                results.append([element.text, element.get_attribute('href')])
            return results
        except:
            return []

This is part of a class that reuses the same seleiunm instance until the program shuts down

SOLUTION

import requests



    def search(self, query):
        re = requests.get('https://www.youtube.com/results?search_query={}'.format(str(query).replace(' ', '+')))
        index = 1
        j = 0
        result = []
        while j <= 40: #results are located at every 4 videoId tag
            newindex = re.text.find('"videoId":"', index)
            videonameindex = re.text.find('{"text"', newindex)
            index = newindex +1
            if j%4 == 0:
                
                videoname = re.text[videonameindex+8:videonameindex+100]
                name = videoname.split('}],')[0].replace('"','')
                videoId = re.text[newindex:newindex+30].split(':')[1].split(',')[0].replace('"','')
                # make sure the video ID is valid
                if len(videoId) != 11:
                    continue
                url = f'https://www.youtube.com/watch?v={videoId}'
                result.append([name, url])
            j += 1
        self.conn.commit()
        return result

A bit longer code, but now there is no long wait for selenium to load up, and no need to wait for javascript to finish executing

Thanks to @Benjamin Loison


Solution

  • If you proceed to curl https://www.youtube.com/results?search_query=test, you will realize that the results data you are looking for are part of the JavaScript variable ytInitialData. I would recommend you to just fetch this HTML file and parse its JavaScript variable ytInitialData. In that way you don't need to use any JavaScript interpreter such as Selenium that is particularly slow as it isn't required.

    Note: I am developing an open-source alternative to the YouTube Data API v3 using this method. I have an endpoint similar to what you are looking for by the way.