Adapting to the disappearance of the IMDb datasets

So the freely available IMDb datasets will disappear at the end of 2017.

From what I understand, you must:

identify yourself (register a personal account for all access)
pay money (once a free quota is used up, though the actual price may be minuscule)
write code (though it looks like you're downloading .gz files, so probably simple)

Some questions arise from this:

What does the data format look like? There's a brief example on the page, but does anyone have an actual file showing how titles, years, votes, etc. are formatted and linked?
What are your options if you don't want to go along with this regime? Are there freely available copies of the datasets somewhere? What other freely available film databases exist that at least cover all movies and TV series with a minimum of interest released since 2017 onward.

Solution

Talking about the paywall

The new files amount to about 360 megabytes of data, so from what I understand of the S3 pricing, you will be well inside the free cap unless you'll download it many times a month.

What does the data format look like?

They seem to be dumps of database tables.

As an example, here it is the beginning of title.basics.tsv.gz:

tconst  titleType       primaryTitle    originalTitle   isAdult startYear       endYear runtimeMinutes  genres
tt0000001       short   Carmencita      Carmencita      0       1894    \N      1       Documentary,Short
tt0000002       short   Le clown et ses chiens  Le clown et ses chiens  0       1892    \N      5       Animation,Short
tt0000003       short   Pauvre Pierrot  Pauvre Pierrot  0       1892    \N      4       Animation,Comedy,Romance
tt0000004       short   Un bon bock     Un bon bock     0       1892    \N      \N      Animation,Short

The available files are: title.basics.tsv.gz, title.crew.tsv.gz, title.episode.tsv.gz, title.principals.tsv.gz, title.ratings.tsv.gz and name.basics.tsv.gz

In terms of contained data, those are the fields in each file:

name.basics.tsv.gz
nconst primaryName birthYear deathYear primaryProfession knownForTitles

title.basics.tsv.gz
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres

title.crew.tsv.gz
tconst directors writers

title.episode.tsv.gz
tconst parentTconst seasonNumber episodeNumber

title.principals.tsv.gz
tconst principalCast

title.ratings.tsv.gz
tconst averageRating numVotes

Talking about number of lines in each file, we currently (2017-080-21) have:

name.basics.tsv.gz 8086560
title.basics.tsv.gz 4466246
title.crew.tsv.gz 4466246
title.episode.tsv.gz 2934335
title.principals.tsv.gz 3957899
title.ratings.tsv.gz 757412

What are your options if you don't want to go along with this regime?

Not many, I fear. But if the price is the only concern, see above.

All of my findings about the new format are in this thread on the imdbpy-devel mailing list

What other freely available film databases exist

I think the best alternative are https://www.themoviedb.org/ and http://www.omdbapi.com/ but I'm not too familiar with neither.