Search code examples
amazon-s3datasetimdb

Adapting to the disappearance of the IMDb datasets


So the freely available IMDb datasets will disappear at the end of 2017.

From what I understand, you must:

  • identify yourself (register a personal account for all access)
  • pay money (once a free quota is used up, though the actual price may be minuscule)
  • write code (though it looks like you're downloading .gz files, so probably simple)

Some questions arise from this:

  1. What does the data format look like? There's a brief example on the page, but does anyone have an actual file showing how titles, years, votes, etc. are formatted and linked?
  2. What are your options if you don't want to go along with this regime? Are there freely available copies of the datasets somewhere? What other freely available film databases exist that at least cover all movies and TV series with a minimum of interest released since 2017 onward.

Solution

  • Talking about the paywall

    The new files amount to about 360 megabytes of data, so from what I understand of the S3 pricing, you will be well inside the free cap unless you'll download it many times a month.

    What does the data format look like?

    They seem to be dumps of database tables.

    As an example, here it is the beginning of title.basics.tsv.gz:

    tconst  titleType       primaryTitle    originalTitle   isAdult startYear       endYear runtimeMinutes  genres
    tt0000001       short   Carmencita      Carmencita      0       1894    \N      1       Documentary,Short
    tt0000002       short   Le clown et ses chiens  Le clown et ses chiens  0       1892    \N      5       Animation,Short
    tt0000003       short   Pauvre Pierrot  Pauvre Pierrot  0       1892    \N      4       Animation,Comedy,Romance
    tt0000004       short   Un bon bock     Un bon bock     0       1892    \N      \N      Animation,Short
    

    The available files are: title.basics.tsv.gz, title.crew.tsv.gz, title.episode.tsv.gz, title.principals.tsv.gz, title.ratings.tsv.gz and name.basics.tsv.gz

    In terms of contained data, those are the fields in each file:

    name.basics.tsv.gz
    nconst primaryName birthYear deathYear primaryProfession knownForTitles
    
    title.basics.tsv.gz
    tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
    
    title.crew.tsv.gz
    tconst directors writers
    
    title.episode.tsv.gz
    tconst parentTconst seasonNumber episodeNumber
    
    title.principals.tsv.gz
    tconst principalCast
    
    title.ratings.tsv.gz
    tconst averageRating numVotes
    

    Talking about number of lines in each file, we currently (2017-080-21) have:

    name.basics.tsv.gz 8086560
    title.basics.tsv.gz 4466246
    title.crew.tsv.gz 4466246
    title.episode.tsv.gz 2934335
    title.principals.tsv.gz 3957899
    title.ratings.tsv.gz 757412
    

    What are your options if you don't want to go along with this regime?

    Not many, I fear. But if the price is the only concern, see above.

    All of my findings about the new format are in this thread on the imdbpy-devel mailing list

    What other freely available film databases exist

    I think the best alternative are https://www.themoviedb.org/ and http://www.omdbapi.com/ but I'm not too familiar with neither.