I'm using the IMDB dataset, and in the end, the headers are mentioned.
Question: 1. Report that captures the trend of the number of letters in movie titles over the years. 2. cross tab between the year of the video release and the quantile that length falls under. The results should contain year, min_length, max_length, num_videos_less_than25Percentile, num_videos_25_50Percentile , num_videos_50_75Percentile, num_videos_greaterthan75Precentile
1st part can be solved easily. For 2nd part, being a beginner in cross tab, I know the syntax but could someone guide how to proceed with the solution.
Please let me know if any more information is required.
imdb.columns
Index(['fn', 'tid', 'title', 'wordsInTitle', 'url', 'imdbRating',
'ratingCount', 'duration', 'year', 'type', 'nrOfWins',
'nrOfNominations', 'nrOfPhotos', 'nrOfNewsArticles', 'nrOfUserReviews',
'nrOfGenre', 'Action', 'Adult', 'Adventure', 'Animation', 'Biography',
'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy',
'FilmNoir', 'GameShow', 'History', 'Horror', 'Music', 'Musical',
'Mystery', 'News', 'RealityTV', 'Romance', 'SciFi', 'Short', 'Sport',
'TalkShow', 'Thriller', 'War', 'Western'],
dtype='object')
It can be solved using groupby
:
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/mandarkadam/Data-Mining-IMDB-score/master/imdb-movies-dataset%20-%20test.csv')
df.groupby('year').duration.agg(
min_length='min',
max_length='max',
num_videos_less_than25Percentile=lambda x: np.count_nonzero((x<x.quantile(.25)).values),
num_videos_25_50Percentile=lambda x: np.count_nonzero(((x>=x.quantile(.25))&(x<x.quantile(.5))).values),
num_videos_50_75Percentile=lambda x: np.count_nonzero(((x>=x.quantile(.5))&(x<x.quantile(.75))).values),
num_videos_greaterthan75Precentile=lambda x: np.count_nonzero((x>=x.quantile(.75)).values)
).astype('Int64')
Result:
min_length max_length num_videos_less_than25Percentile num_videos_25_50Percentile num_videos_50_75Percentile num_videos_greaterthan75Precentile
year
1888.0 2 2 0 0 0 1
1894.0 22 22 0 0 0 1
1904.0 <NA> <NA> 0 0 0 0
1910.0 660 660 0 0 0 1
1911.0 1080 1080 0 0 0 1
... ... ... ... ... ... ...
2009.0 3000 7080 4 2 5 4
2010.0 3120 6840 1 1 1 2
2011.0 5580 7620 2 1 1 2
2012.0 1800 6000 1 0 1 1
2014.0 1800 1800 0 0 0 1
[94 rows x 6 columns]