python loops youtube-api youtube-data-api

YouTube comments extractor infinite loop if there is too many comments

I coded a script to extract and store YouTube's video comments in a file given the id of the video. If the video has less than 10-15 comments there are no problems and the script works fine, but when there are more it goes in an infinite loop and I can't figure out why.

from googleapiclient.discovery import build 
import os
api_key = '...'

def video_comments(video_id): 
    # empty file for storing comments
    outputFile = open("comments_"+video_id+".txt", "w", encoding='utf-8')

    # empty dictionnary to store the data
    commentsDict = []

    # empty list for storing reply 
    replies = [] 

    # creating youtube resource object 
    youtube = build('youtube', 'v3', 
                    developerKey=api_key) 

    # retrieve youtube video results 
    video_response=youtube.commentThreads().list( 
    part='snippet,replies', 
    videoId=video_id 
    ).execute() 

    # iterate video response 
    while video_response: 
        
        # extracting required info 
        # from each result object 
        for item in video_response['items']: 
            # Extracting comments 
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay'] 
            commentEntrie = {"comment": comment, 'replies': []}
            
            # counting number of reply of comment 
            replycount = item['snippet']['totalReplyCount'] 

            # if reply is there 
            if replycount>0: 
                
                # iterate through all reply 
                for reply in item['replies']['comments']: 
                    
                    # Extract reply 
                    reply = reply['snippet']['textDisplay'] 
                    
                    # Store reply is list 
                    replies.append(reply) 
                    commentEntrie['replies'].append(reply)
                    
            # print comment with list of reply 
            print(comment, replies, end = '\n\n')
            outputFile.write("%s" % comment)
            outputFile.write("%s\n" % replies)
            commentsDict.append(commentEntrie)
            # empty reply list 
            replies = [] 

        # Again repeat 
        if 'nextPageToken' in video_response: 
            video_response = youtube.commentThreads().list( 
                    part = 'snippet,replies', 
                    videoId = video_id 
                ).execute() 
        else: 
            break
    outputFile.close()
    print(commentsDict)

# Enter video id 
video_id = "aDHYbM9OqUc" 

# Call function 
video_comments(video_id)

I can provide two video id, this one LVgKlfw4DHc works fine but this one end in an infinite loop aDHYbM9OqUc Any ideas ?

[EDIT] I feel like the nextPageToken is always here and it goes in infinite while

Solution

Your loop while video_response: goes infinite because of this piece of code:

if 'nextPageToken' in video_response: 
    video_response = youtube.commentThreads().list( 
        part = 'snippet,replies', 
        videoId = video_id 
    ).execute() 
else: 
    break

If the first video_response contains the property nextPageToken, then the call to CommentThreads.list that's inside the loop is exactly the same as the one that's outside the loop. Thus, by this second call, you're getting exactly the same video_response as the one obtained from the previous call.

The correct implementation would be:

if 'nextPageToken' in video_response: 
    video_response = youtube.commentThreads().list( 
        pageToken = video_response['nextPageToken'],
        part = 'snippet,replies', 
        videoId = video_id 
    ).execute() 
else: 
    break

Since you're using the Google's APIs Client Library for Python, the pythonic way of implementing result set pagination on the CommentThreads.list API endpoint looks like the one below:

request = youtube.commentThreads().list(
    part = 'snippet,replies', 
    videoId = video_id 
)

while request:
    response = request.execute()

    for item in response['items']:
        ...

    request = youtube.commentThreads().list_next(
        request, response)

It is this simple due to the way the Python client library is implemented: there's no need to handle explicitly the API response object's property nextPageToken and the API request parameter pageToken at all.