python python-3.x beautifulsoup python-requests python-requests-html

Python scraping too slow on youtube title from URL - html-render

hi i have excel files with youtube url list which i m trying to get their titles as it's full lists of 1000's url's with 3 excel file i tried to work with python but it comes to be too slow as i had to put sleep command on html render codes are like that :

 import xlrd
import time
from bs4 import BeautifulSoup
import requests
from xlutils.copy import copy
from requests_html import HTMLSession



loc = ("testt.xls")

wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
wb2 = copy(wb)
sheet.cell_value(0, 0)

for i in range(3,sheet.nrows):


    ytlink = (sheet.cell_value(i, 0))
    session = HTMLSession()
    response = session.get(ytlink)
    response.html.render(sleep=3)
    print(sheet.cell_value(i, 0))
    print(ytlink)
    element = BeautifulSoup(response.html.html, "lxml")
    media = element.select_one('#container > h1').text
    print(media)
    s2 = wb2.get_sheet(0)
    s2.write(i, 0, media)
    wb2.save("testt.xls")

I mean is there anyway to make it faster i tried selenium but it was slower i guess. and with this html.render i seem to need to use "Sleep" timer or else it gives me error i tried lower values on sleep but it gets error after a while on lower sleep values any help please thanks :)

ps: prints i put are just for checking the output and such not important on usage.

Solution

You can do 1000 requests in less than a minute using async requests-html like this:

import random
from time import perf_counter
from requests_html import AsyncHTMLSession

urls = ['https://www.youtube.com/watch?v=z9eoubnO-pE'] * 1000

asession = AsyncHTMLSession()
start = perf_counter()

async def fetch(url):
    r = await asession.get(url, cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))})
    return r

all_responses = asession.run(*[lambda url=url: fetch(url) for url in urls])
all_titles = [r.html.find('title', first=True).text for r in all_responses]

print(all_titles)
print(perf_counter() - start)

Done in 55s on my laptop.

Note that you need to pass cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))} to the request to avoid this issue.