I'm using IMDbPY in conjunction with the publicly available IMDb datasets (https://www.imdb.com/interfaces/) to create a custom dataset with pandas
. The public datasets contain a lot of great info, but don't contain plot info as far as I can see. IMDbPY does contain plot summaries, in addition to plot synopses and keywords for plots in the form of the plot, synopsis, and keywords keys of the movie class/dictionary.
I can get the plot for individual keys by making an API call: ia.get_movie(movie_index[2:])['plot'][0]
where I use [2:] because the first 2 characters of the index are 'tt' in the public dataset and [0] because there are many plot summaries so I am taking the first one from IMDbPY.
However, to get 10,000 plot summaries, I would need to make 10,000 API calls which would take me 7.5 hours, assuming each API call takes 2.7 seconds (which is what I found using tqdm
). So a solution to this is to let it run overnight. Are there any other solutions? Also, is there a better way of doing this than my current way of creating a dictionary with the keys as movie index (e.g. tt0111161 for "Shawshank Redemption") and the values as plots and then converting that dictionary to a dataframe? Any insight is appreciated. My code is below:
movie_dict = {}
for movie_index in tqdm(movies_index[0:10]):
#movie = ia.get_movie(movie_index[2:])
try:
movie_dict[movie_index] = ia.get_movie(movie_index[2:])['plot'][0]
except:
movie_dict[movie_index] = ''
plots = pd.DataFrame.from_dict(movie_dict, orient='index')
plots.rename(columns={0:'plot'}, inplace=True)
plots
plot
tt0111161 Two imprisoned men bond over a number of years...
tt0468569 When the menace known as the Joker emerges fro...
tt1375666 A thief who steals corporate secrets through t...
tt0137523 An insomniac office worker and a devil-may-car...
tt0110912 The lives of two mob hitmen, a boxer, a gangst...
tt0109830 The presidencies of Kennedy and Johnson, the e...
tt0120737 A meek Hobbit from the Shire and eight compani...
tt0133093 A computer hacker learns from mysterious rebel...
tt0167260 Gandalf and Aragorn lead the World of Men agai...
tt0068646 The aging patriarch of an organized crime dyna...
First of all, consider that doing so many queries in so little time may be considered against their terms of service: https://www.imdb.com/conditions
However, 10.000 queries to a major web site is not that much to create any real problem, especially if you wait few seconds between each call just for being nicer (it will take longer, but that should not be a big deal in your case - but again see above regarding the license, that you must respect).
I can suggest two different options:
Disclaimer: I'm one of the main authors of IMDbPY