Search code examples
pythonpython-3.xbeautifulsouppython-requestspython-requests-html

Python scraping too slow on youtube title from URL - html-render


hi i have excel files with youtube url list which i m trying to get their titles as it's full lists of 1000's url's with 3 excel file i tried to work with python but it comes to be too slow as i had to put sleep command on html render codes are like that :

 import xlrd
import time
from bs4 import BeautifulSoup
import requests
from xlutils.copy import copy
from requests_html import HTMLSession



loc = ("testt.xls")

wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
wb2 = copy(wb)
sheet.cell_value(0, 0)

for i in range(3,sheet.nrows):


    ytlink = (sheet.cell_value(i, 0))
    session = HTMLSession()
    response = session.get(ytlink)
    response.html.render(sleep=3)
    print(sheet.cell_value(i, 0))
    print(ytlink)
    element = BeautifulSoup(response.html.html, "lxml")
    media = element.select_one('#container > h1').text
    print(media)
    s2 = wb2.get_sheet(0)
    s2.write(i, 0, media)
    wb2.save("testt.xls")    

I mean is there anyway to make it faster i tried selenium but it was slower i guess. and with this html.render i seem to need to use "Sleep" timer or else it gives me error i tried lower values on sleep but it gets error after a while on lower sleep values any help please thanks :)

ps: prints i put are just for checking the output and such not important on usage.


Solution

  • You can do 1000 requests in less than a minute using async requests-html like this:

    import random
    from time import perf_counter
    from requests_html import AsyncHTMLSession
    
    urls = ['https://www.youtube.com/watch?v=z9eoubnO-pE'] * 1000
    
    asession = AsyncHTMLSession()
    start = perf_counter()
    
    async def fetch(url):
        r = await asession.get(url, cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))})
        return r
    
    all_responses = asession.run(*[lambda url=url: fetch(url) for url in urls])
    all_titles = [r.html.find('title', first=True).text for r in all_responses]
    
    print(all_titles)
    print(perf_counter() - start)
    

    Done in 55s on my laptop.

    Note that you need to pass cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))} to the request to avoid this issue.