So the freely available IMDb datasets will disappear at the end of 2017.
From what I understand, you must:
Some questions arise from this:
Talking about the paywall
The new files amount to about 360 megabytes of data, so from what I understand of the S3 pricing, you will be well inside the free cap unless you'll download it many times a month.
What does the data format look like?
They seem to be dumps of database tables.
As an example, here it is the beginning of title.basics.tsv.gz:
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
tt0000001 short Carmencita Carmencita 0 1894 \N 1 Documentary,Short
tt0000002 short Le clown et ses chiens Le clown et ses chiens 0 1892 \N 5 Animation,Short
tt0000003 short Pauvre Pierrot Pauvre Pierrot 0 1892 \N 4 Animation,Comedy,Romance
tt0000004 short Un bon bock Un bon bock 0 1892 \N \N Animation,Short
The available files are: title.basics.tsv.gz, title.crew.tsv.gz, title.episode.tsv.gz, title.principals.tsv.gz, title.ratings.tsv.gz and name.basics.tsv.gz
In terms of contained data, those are the fields in each file:
nconst primaryName birthYear deathYear primaryProfession knownForTitles
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
tconst directors writers
tconst parentTconst seasonNumber episodeNumber
tconst principalCast
tconst averageRating numVotes
Talking about number of lines in each file, we currently (2017-080-21) have:
name.basics.tsv.gz 8086560
title.basics.tsv.gz 4466246
title.crew.tsv.gz 4466246
title.episode.tsv.gz 2934335
title.principals.tsv.gz 3957899
title.ratings.tsv.gz 757412
What are your options if you don't want to go along with this regime?
Not many, I fear. But if the price is the only concern, see above.
All of my findings about the new format are in this thread on the imdbpy-devel mailing list
What other freely available film databases exist
I think the best alternative are and but I'm not too familiar with neither.