Search code examples
pythonpandascrosstab

Query on crosstab using IMDB dataset


I'm using the IMDB dataset, and in the end, the headers are mentioned.

Question: 1. Report that captures the trend of the number of letters in movie titles over the years. 2. cross tab between the year of the video release and the quantile that length falls under. The results should contain year, min_length, max_length, num_videos_less_than25Percentile, num_videos_25_50Percentile , num_videos_50_75Percentile, num_videos_greaterthan75Precentile

1st part can be solved easily. For 2nd part, being a beginner in cross tab, I know the syntax but could someone guide how to proceed with the solution.

Please let me know if any more information is required.

    imdb.columns
Index(['fn', 'tid', 'title', 'wordsInTitle', 'url', 'imdbRating',
       'ratingCount', 'duration', 'year', 'type', 'nrOfWins',
       'nrOfNominations', 'nrOfPhotos', 'nrOfNewsArticles', 'nrOfUserReviews',
       'nrOfGenre', 'Action', 'Adult', 'Adventure', 'Animation', 'Biography',
       'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy',
       'FilmNoir', 'GameShow', 'History', 'Horror', 'Music', 'Musical',
       'Mystery', 'News', 'RealityTV', 'Romance', 'SciFi', 'Short', 'Sport',
       'TalkShow', 'Thriller', 'War', 'Western'],
      dtype='object')

Solution

  • It can be solved using groupby:

    import pandas as pd
    import numpy as np
    df = pd.read_csv('https://raw.githubusercontent.com/mandarkadam/Data-Mining-IMDB-score/master/imdb-movies-dataset%20-%20test.csv')
    
    df.groupby('year').duration.agg(
        min_length='min',
        max_length='max',
        num_videos_less_than25Percentile=lambda x: np.count_nonzero((x<x.quantile(.25)).values),
        num_videos_25_50Percentile=lambda x: np.count_nonzero(((x>=x.quantile(.25))&(x<x.quantile(.5))).values),
        num_videos_50_75Percentile=lambda x: np.count_nonzero(((x>=x.quantile(.5))&(x<x.quantile(.75))).values),
        num_videos_greaterthan75Precentile=lambda x: np.count_nonzero((x>=x.quantile(.75)).values)
    ).astype('Int64')
    

    Result:

            min_length  max_length  num_videos_less_than25Percentile  num_videos_25_50Percentile  num_videos_50_75Percentile  num_videos_greaterthan75Precentile
    year                                                                                                                                                        
    1888.0           2           2                                 0                           0                           0                                   1
    1894.0          22          22                                 0                           0                           0                                   1
    1904.0        <NA>        <NA>                                 0                           0                           0                                   0
    1910.0         660         660                                 0                           0                           0                                   1
    1911.0        1080        1080                                 0                           0                           0                                   1
    ...            ...         ...                               ...                         ...                         ...                                 ...
    2009.0        3000        7080                                 4                           2                           5                                   4
    2010.0        3120        6840                                 1                           1                           1                                   2
    2011.0        5580        7620                                 2                           1                           1                                   2
    2012.0        1800        6000                                 1                           0                           1                                   1
    2014.0        1800        1800                                 0                           0                           0                                   1
    
    [94 rows x 6 columns]