Search code examples
web-scrapingvideoyoutubegoogle-colaboratory

Yoututbe scraping by colab


I need to scrape car type video from YouTube by some tags like this list in Google Colab :

Abarth
AC
Acura
Adam
Adler
AEC
Aero
Aixam
Albion

SO i have tried these two code to find the video tag ( for example tag='Peugeot') in google colab:

!pip install youtube-search-python
from youtubesearchpython import SearchVideos

search = SearchVideos("NoCopyrightSounds", offset = 1, mode = "json", max_results = 20)

print(search.result())

and


!pip install youtube-dl
!echo '' > ford_video_list.txt
!chmod 755 ford_video_list.txt

!youtube-dl --match-title 'ford' --add-metadata --write-thumbnail --list-thumbnails    --mark-watched --write-info-json 'ford_video_description_json.txt' --write-description 'ford_video_description.txt'  --cookies='Search-youtube-url-file.txt' --ignore-errors  --skip-download --get-url -f bestvideo+bestaudio/best --default-search "ytsearch2000:" "Ford Festiva" >> ford_video_list.txt 

!echo '*****End of test 1 ******'

But by trying this code it don't showing any result:

import urllib.request
from bs4 import BeautifulSoup

textToSearch = 'python tutorials'
query = urllib.parse.quote(textToSearch)
url = "https://www.youtube.com/results?search_query=" + query
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
for vid in soup.findAll(attrs={'class':'yt-uix-tile-link'}):
    if not vid['href'].startswith("https://googleads.g.doubleclick.net/"):
        print('https://www.youtube.com' + vid['href'])

So, i guess the class name is not correct!, and i asked here for debugging it.

Update: I have made one google colab page (shown below) to test those codes ( also the code of youtube-dl showing this error:

https://colab.research.google.com/drive/1bZQ68gLggTQHCG_5fQQJJTICHA4K3HJ3?usp=sharing

ERROR: Unable to download webpage: HTTP Error 429: Too Many Requests (caused by <HTTPError 429: 'Too Many Requests'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

I understand that error made because:

The google don't like too many request form one IP Address.

So tried to add these tags(--rm-cache-dir --force-ipv4 --verbose) to youtube-dl command as you can see below ( based of these reffrences 1 2 3):

enter image description here


ERROR: Unable to download webpage: HTTP Error 429: Too Many Requests (caused by <HTTPError 429: 'Too Many Requests'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
  File "/usr/local/lib/python3.6/dist-packages/youtube_dl/extractor/common.py", line 632, in _request_webpage
    return self._downloader.urlopen(url_or_request)
  File "/usr/local/lib/python3.6/dist-packages/youtube_dl/YoutubeDL.py", line 2238, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 564, in error
    result = self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 756, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)

enter image description here

Thanks.


Solution

  • It has been working by changing the '"ytsearch2000:" "Ford Festiva"` :

    to ' "ytsearch50":"Ford Festiva" as you can see below:

    !pip install youtube-dl 
    # !youtube-dl --default-search gvsearch5:how to develop for android --no-playlist --write-info-json --write-annotation --write-thumbnail --write-sub --skip-download 
    !youtube-dl --match-title 'ford' "ytsearch50":"Ford Festiva"+"peugeot 405"  --write-info-json --write-annotation --write-thumbnail --write-sub -f 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/mp4'
    

    and the problem was because of : and " location mistaking!

    the entire code for scraping the video of some car type video from google could be seen here: https://colab.research.google.com/github/CAR-Driving/yoloOnGoogleColab/blob/master/database_creating/Yoututbe_scraping_by_colab.ipynb