I coded a script to extract and store YouTube's video comments in a file given the id of the video. If the video has less than 10-15 comments there are no problems and the script works fine, but when there are more it goes in an infinite loop and I can't figure out why.
from googleapiclient.discovery import build
import os
api_key = '...'
def video_comments(video_id):
# empty file for storing comments
outputFile = open("comments_"+video_id+".txt", "w", encoding='utf-8')
# empty dictionnary to store the data
commentsDict = []
# empty list for storing reply
replies = []
# creating youtube resource object
youtube = build('youtube', 'v3',
developerKey=api_key)
# retrieve youtube video results
video_response=youtube.commentThreads().list(
part='snippet,replies',
videoId=video_id
).execute()
# iterate video response
while video_response:
# extracting required info
# from each result object
for item in video_response['items']:
# Extracting comments
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
commentEntrie = {"comment": comment, 'replies': []}
# counting number of reply of comment
replycount = item['snippet']['totalReplyCount']
# if reply is there
if replycount>0:
# iterate through all reply
for reply in item['replies']['comments']:
# Extract reply
reply = reply['snippet']['textDisplay']
# Store reply is list
replies.append(reply)
commentEntrie['replies'].append(reply)
# print comment with list of reply
print(comment, replies, end = '\n\n')
outputFile.write("%s" % comment)
outputFile.write("%s\n" % replies)
commentsDict.append(commentEntrie)
# empty reply list
replies = []
# Again repeat
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
part = 'snippet,replies',
videoId = video_id
).execute()
else:
break
outputFile.close()
print(commentsDict)
# Enter video id
video_id = "aDHYbM9OqUc"
# Call function
video_comments(video_id)
I can provide two video id, this one LVgKlfw4DHc
works fine but this one end in an infinite loop aDHYbM9OqUc
Any ideas ?
[EDIT] I feel like the nextPageToken
is always here and it goes in infinite while
Your loop while video_response:
goes infinite because of this piece of code:
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
part = 'snippet,replies',
videoId = video_id
).execute()
else:
break
If the first video_response
contains the property nextPageToken
, then the call to CommentThreads.list
that's inside the loop is exactly the same as the one that's outside the loop. Thus, by this second call, you're getting exactly the same video_response
as the one obtained from the previous call.
The correct implementation would be:
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
pageToken = video_response['nextPageToken'],
part = 'snippet,replies',
videoId = video_id
).execute()
else:
break
Since you're using the Google's APIs Client Library for Python, the pythonic way of implementing result set pagination on the CommentThreads.list
API endpoint looks like the one below:
request = youtube.commentThreads().list(
part = 'snippet,replies',
videoId = video_id
)
while request:
response = request.execute()
for item in response['items']:
...
request = youtube.commentThreads().list_next(
request, response)
It is this simple due to the way the Python client library is implemented: there's no need to handle explicitly the API response object's property nextPageToken
and the API request parameter pageToken
at all.