I have 2 DataFrames.
DF1:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
DF2:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
My new DataFrame should look like this.
DF_new:
userID Toy Story Jumanji Grumpier Old Men Waiting to Exhale Father of the Pride Part II
1 4.0
2
3
4
The Values will be the ratings of the the indiviudel user to each movie. The movie titles are now the columns. The userId are now the rows.
I think it will work over concatinating via the movieid. But im not sure how to do this exactly, so that i still have the movie names attached to the movieid.
Anybody has an idea?
The problem consists of essentially 2 parts:
df2
, the sole table where user ratings comes from, to the desired format. pd.DataFrame.pivot_table is the standard way to go.movieID
s to their names. This can be easily done by direct substitution on df.columns
.In addition, if movies receiving no ratings were to be listed as well, just insert the missing movieID
s directly before name substitution mentioned previously.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={
"movieId": [1,2,3,4,5],
"title": ["toy story (1995)",
"Jumanji (1995)",
"Grumpier 33 (1995)", # shortened for printing
"Waiting 44 (1995)",
"Father 55 (1995)"],
}
)
# to better demonstrate the correctness, 2 distinct user ids were used.
df2 = pd.DataFrame(
data={
"userId": [1,1,1,2,2],
"movieId": [1,2,2,3,5],
"rating": [4,5,4,5,4]
}
)
# 1. Produce the main table
df_new = df2.pivot_table(index=["userId"], columns=["movieId"], values="rating")
print(df_new) # already pretty close
Out[17]:
movieId 1 2 3 5
userId
1 4.0 4.5 NaN NaN
2 NaN NaN 5.0 4.0
# 2. map movie ID's to titles
# name lookup dataset
df_names = df1[["movieId", "title"]].set_index("movieId")
# strip the last 7 characters containing year
# (assume consistent formatting in df1)
df_names["title"] = df_names["title"].apply(lambda s: s[:-7])
# (optional) fill unrated columns and sort
for movie_id in df_names.index.values:
if movie_id not in df_new.columns.values:
df_new[movie_id] = np.nan
else:
df_new = df_new[df_names.index.values]
# replace IDs with titles
df_new.columns = df_names.loc[df_new.columns, "title"].values
df_new
Out[16]:
toy story Jumanji Grumpier 33 Waiting 44 Father 55
userId
1 4.0 4.5 NaN NaN NaN
2 NaN NaN 5.0 NaN 4.0