Creating a new DataFrame out of 2 existing Dataframes with Values coming from Dataframe 1?

I have 2 DataFrames.

DF1:

movieId title   genres
0   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy
1   2   Jumanji (1995)  Adventure|Children|Fantasy
2   3   Grumpier Old Men (1995) Comedy|Romance
3   4   Waiting to Exhale (1995)    Comedy|Drama|Romance
4   5   Father of the Bride Part II (1995)  Comedy

DF2:

userId  movieId rating  timestamp
0   1   1   4.0 964982703
1   1   3   4.0 964981247
2   1   6   4.0 964982224
3   1   47  5.0 964983815
4   1   50  5.0 964982931

My new DataFrame should look like this.

DF_new:

userID Toy Story Jumanji Grumpier Old Men Waiting to Exhale Father of the Pride Part II
1       4.0
2
3
4

The Values will be the ratings of the the indiviudel user to each movie. The movie titles are now the columns. The userId are now the rows.

I think it will work over concatinating via the movieid. But im not sure how to do this exactly, so that i still have the movie names attached to the movieid.

Anybody has an idea?

Solution

The problem consists of essentially 2 parts:

How to transpose df2, the sole table where user ratings comes from, to the desired format. pd.DataFrame.pivot_table is the standard way to go.
The rest is about mapping the movieIDs to their names. This can be easily done by direct substitution on df.columns.

In addition, if movies receiving no ratings were to be listed as well, just insert the missing movieIDs directly before name substitution mentioned previously.

Code

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    data={
        "movieId": [1,2,3,4,5],
        "title": ["toy story (1995)",
                  "Jumanji (1995)",
                  "Grumpier 33 (1995)",  # shortened for printing
                  "Waiting 44 (1995)",
                  "Father 55 (1995)"],

    }
)
# to better demonstrate the correctness, 2 distinct user ids were used.
df2 = pd.DataFrame(
    data={
        "userId": [1,1,1,2,2],
        "movieId": [1,2,2,3,5],
        "rating": [4,5,4,5,4]
    }
)

# 1. Produce the main table
df_new = df2.pivot_table(index=["userId"], columns=["movieId"], values="rating")

print(df_new)  # already pretty close
Out[17]: 
movieId    1    2    3    5
userId                     
1        4.0  4.5  NaN  NaN
2        NaN  NaN  5.0  4.0


# 2. map movie ID's to titles

# name lookup dataset
df_names = df1[["movieId", "title"]].set_index("movieId")

# strip the last 7 characters containing year
# (assume consistent formatting in df1)
df_names["title"] = df_names["title"].apply(lambda s: s[:-7])

# (optional) fill unrated columns and sort
for movie_id in df_names.index.values:
    if movie_id not in df_new.columns.values:
        df_new[movie_id] = np.nan
else:
    df_new = df_new[df_names.index.values]

# replace IDs with titles
df_new.columns = df_names.loc[df_new.columns, "title"].values

Result

df_new
Out[16]: 
        toy story  Jumanji  Grumpier 33  Waiting 44  Father 55
userId                                                        
1             4.0      4.5          NaN         NaN        NaN
2             NaN      NaN          5.0         NaN        4.0