Search code examples
pythonyoutubeyoutube-data-apipython-multiprocessingmultiprocess

How to use Python Multiprocessing with YouTube API for crawling


I'm still a novice with python and now using multiprocessing is a big job for me.

So my question is, how do I speed to crawl the comments section of YouTube using the YouTube API whilst using multiprocessing?

This project is to crawl few 100000++ of videos for their comments in a limited time. I understand that multiprocessing is used on normal scraping methods such as BeautifulSoup/Scrapy, but how about when I use the YouTube API?

If I use the YouTube API (which requires API keys) to crawl the data, will multiprocessing be able to do the job using multiple keys or will it use the same one over and over again for different tasks?

To simplify, is it possible to use multiprocessing that uses API keys in the code instead of normal scraping methods that do not require API keys?

Anyone have any idea?


Solution

  • This won't directly answer your question, but I suggest having a look at the YouTube API quota:

    https://developers.google.com/youtube/v3/getting-started#calculating-quota-usage

    By default, your project will have a quota of just 10,000 units per day, and retrieving comments will cost between 1 and 5 units per comment (if you want the video data they're attached to, add another 21 units per video). Realistically, you'll be able to only retrieve 2000 comments per day via the API without putting in a quota increase request, which can take weeks.

    Edit: Google will populate code for you in the language of your choice for a given request. I'd recommend populating the form here with your request, and using that as a starting point: https://developers.google.com/youtube/v3/docs/comments/list (click "Populate APIs Explorer" -> "See Code Samples" -> enter more info on the left)