I need help for my Dissertation Project. I am working on a Python project as part of my Masters Degree at my university in England, UK. The dataset I have gotten through the Kaggle platform which contains over one million movies in terms of their titles, budget, box-office revenue, genres, popularity, reviews, keywords etc. Here is the weblink for clarification (I got the latest update):
https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies
UPDATE: The cleaning of the data is completed but I need help in collaboration with data visualizations for my Exploratory Data Analysis (EDA) because as the dataset is larger for memory at 504mb it has created ridiculous visualizations from the data. I am very familiar with matplotlib
and seaborn
functions for Python.
The simple codes I used for count plots for part of this data was:
plt.figure(figsize=(20,16))
sns.countplot(x = 'Genres', data=df2)
plt.xlabel('Genres', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.show()
Does anyone know how I can create clear and concise data visualizations for specific parts of the data in terms of Bivariate, Univariate and Multivariate analysis?
For example:
Most popular movie genres/movie production companies in terms of frequency
Top movies with biggest box office revenue/budgets, popularity etc.
Countries with biggest number of production/distribution movies made.
And much more. Anything would help. Thank you very much in advance.
If df is the dataframe you got from pd.read_csv(), you can filter on the strings inside the column 'production_countries' by using str.contains().
Note that "|" means "OR", that na=False is to ignore the few missing data in this column and case=False to ignore the case(lower/upper) though it seems that case is not necessary for this dataset.
df = df[df['production_countries'].str.contains("United States|United Kingdom", na=False, case=False)]