Search code examples
pythonstringanalysis

Python: Getting only one string of interest out of a series of similar strings


I am looking into this dataset: https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset?select=rotten_tomatoes_movies.csv

I am interested in scores grouped by production companies, but some companies have subdivisions that are very similar to each other, e.g. 20th Century Fox, 20th Century Fox Distribution, 20th Century Fox Film, 20th Century Fox Film Corp., and so on.

I am searching for a way to collect all the movies produced under subdivision into one category, in this case 20th Century Fox - as I am not interested in their specific division.

I have done some initalization and cleaning of the data based on a Git depository:

import pandas as pd
import numpy as np

df = pd.read_csv('rotten_tomatoes_movies.csv')

cols_of_interest = ['movie_title', 'genres', 'content_rating', 'original_release_date', 'streaming_release_date', 'runtime', 'tomatometer_rating', 'tomatometer_count', 'audience_rating', 'audience_count', 'production_company']

df = df[cols_of_interest]

df.original_release_date.fillna(df.streaming_release_date, inplace=True)
df.drop(columns=['streaming_release_date'], inplace=True)
df.rename(columns={'original_release_date': 'release_date'}, inplace=True)

df = df[(df.tomatometer_count>0) & (df.audience_count>0)]

df.drop_duplicates(subset=['movie_title', 'release_date'], inplace=True)

df.dropna(subset=['genres', 'release_date'], inplace=True)

df = df.sort_values(by='release_date', ascending=True).reset_index(drop=True)

For my specific problem I had the idea to base analysis on the first word using:

df.production_company.astype('|S')
test = df.production_company.split(' ',1)

which gives

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 1: ordinal not in range(128)

Any ideas on other approaches or help on the current Error would be greatly appreciated!


Solution

  • Maybe some production companies are french ones. According to Wikipedia : "The unicode string for \xe9 is an accented e - é". You can try to specify the encoding.

    df = pd.read_csv('rotten_tomatoes_movies.csv', encoding='utf-8')