Search code examples
pythonpaginationyoutubeyoutube-data-api

How to retrieve large amounts of data (5000+ videos) from YouTube Data API v3?


My goal is to extract all videos from a playlist which can have many videos, ~3000 and can have more than 5000 videos. With maxResults=50 and after implementing pagination with nextPageToken, I'm only able to call the API 20 times, after which nextPageToken isn't sent with the response

I'm calling the API from a python application. I have a while loop running till nextPageToken isn't sent, ideally this should happen AFTER all the videos are extracted, but it prematurely exits after calling the API 19-20 times

def main():
    youtube = get_authorised_youtube()  # returns YouTube resource authorized with OAuth.

    first_response = make_single_request(youtube, None)  # make_single_request() takes in the youtube resource and nextPageToken, if any.
    nextPageToken = first_response["nextPageToken"]

    try:
        count = 0
        while True:
            response = make_single_request(youtube, nextPageToken)
            nextPageToken = response["nextPageToken"]
            
            
            count += 1
            print(count, end=" ")
            print(nextPageToken)
    except KeyError as e:  # KeyError to catch if nextPageToken wasn't present
        response.pop("items")
        print(response)  # prints the last response for analysis


if __name__ == '__main__':
    main()

snippet of make_single_request():

def make_single_request(youtube, nextPageToken):
    if nextPageToken is None:
        request = youtube.videos().list(
            part="id",
            myRating="like",
            maxResults=50
        )
    else:
        request = youtube.videos().list(
            part="id",
            myRating="like",
            pageToken=nextPageToken,
            maxResults=50
        )
    response = request.execute()

    return response

Expected the code to make upwards of 50 API calls but is observed to only make around 20 calls, consistently.

Note: The following code was executed with an unpaid GCP account. The calls made has part="id" which has a quota cost of 0. The calls limit according to GCP is: 10,000. According to the quota on the console, I make only 20.

Output:

1 CGQQAA
2 CJYBEAA
3 CMgBEAA
4 CPoBEAA
5 CKwCEAA
6 CN4CEAA
7 CJADEAA
8 CMIDEAA
9 CPQDEAA
10 CKYEEAA
11 CNgEEAA
12 CIoFEAA
13 CLwFEAA
14 CO4FEAA
15 CKAGEAA
16 CNIGEAA
17 CIQHEAA
18 CLYHEAA
19 {'kind': 'youtube#videoListResponse', 'etag': '"ETAG"', 'prevPageToken': 'CLYHEAE', 'pageInfo': {'totalResults': TOTAL_RESULTS(>4000), 'resultsPerPage': 50}}

EDIT: After changing maxResults=20, It is observed that the code makes around 50 API calls, therefore the total number of videos that can be extracted is a constant at 1000.

EDIT #2: Thanks to @Platinum for linking a working solution to this thread. The workaround does not use the Data API, but instead uses Google's myactivity page. Link to thread


Solution

  • if the goal is to retrieve the FULL list of liked videos in a tideous but working way you can checkout this question.

    you basically scrape the data of a deeplink page...

    and whats not mentioned in this post is that after you have retrieved the video ids and you may want more data, you can use the videos endpoint with a list of comma seperated video ids to get more informations.

    if you need inspirations for the script this is an adjusted version of the api scripts that are provided by youtube

    just adjust the credentials file path and the input path of the file thats been retrieved by doing the webscrape

    import os
    
    import google_auth_oauthlib.flow
    import googleapiclient.discovery
    import googleapiclient.errors
    import json
    
    scopes = ["https://www.googleapis.com/auth/youtube.readonly"]
    
    def do_request(youtube, video_ids):
        #https://developers.google.com/youtube/v3/docs/videos/list
        request = youtube.videos().list(
            part='contentDetails,id,snippet,statistics',
            id=','.join(video_ids),
            maxResults=50
        )
    
        return request.execute()["items"]
    
    def main(video_ids):
        # Disable OAuthlib's HTTPS verification when running locally.
        # *DO NOT* leave this option enabled in production.
        os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
    
        api_service_name = "youtube"
        api_version = "v3"
        client_secrets_file = "INPUTAPICREDFILEHERE./creds.json"
    
        # Get credentials and create an API client
        flow = google_auth_oauthlib.flow.InstalledAppFlow.from_client_secrets_file(
            client_secrets_file, scopes)
        credentials = flow.run_console()
        youtube = googleapiclient.discovery.build(
            api_service_name, api_version, credentials=credentials)
    
        data = { 'items': [] }
        current_id_batch = []
        for id in video_ids:
            if len(current_id_batch) == 50:
                print(f"Fetching.. current batch {len(data['items'])} of {len(video_ids)}")
                result = do_request(youtube, current_id_batch)
                data['items'].extend(result)
                current_id_batch = []
            current_id_batch.append(id)
        
        result = do_request(youtube, current_id_batch)
        data['items'].extend(result)
        
        with open('./data.json', 'w') as outfile:
            outfile.write(json.dumps(data, indent=4))
    
    if __name__ == "__main__":
        liked_vids = {}
        f = open('PATHTOLIKEDVIDEOS/liked_videos.json', encoding="utf8")
        liked_vids = json.load(f)
        main(list(liked_vids.keys()))