Search code examples
pandasdataframeconcatenation

Creating a new DataFrame out of 2 existing Dataframes with Values coming from Dataframe 1?


I have 2 DataFrames.

DF1:

movieId title   genres
0   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy
1   2   Jumanji (1995)  Adventure|Children|Fantasy
2   3   Grumpier Old Men (1995) Comedy|Romance
3   4   Waiting to Exhale (1995)    Comedy|Drama|Romance
4   5   Father of the Bride Part II (1995)  Comedy

DF2:

userId  movieId rating  timestamp
0   1   1   4.0 964982703
1   1   3   4.0 964981247
2   1   6   4.0 964982224
3   1   47  5.0 964983815
4   1   50  5.0 964982931

My new DataFrame should look like this.

DF_new:

userID Toy Story Jumanji Grumpier Old Men Waiting to Exhale Father of the Pride Part II
1       4.0
2
3
4

The Values will be the ratings of the the indiviudel user to each movie. The movie titles are now the columns. The userId are now the rows.

I think it will work over concatinating via the movieid. But im not sure how to do this exactly, so that i still have the movie names attached to the movieid.

Anybody has an idea?


Solution

  • The problem consists of essentially 2 parts:

    1. How to transpose df2, the sole table where user ratings comes from, to the desired format. pd.DataFrame.pivot_table is the standard way to go.
    2. The rest is about mapping the movieIDs to their names. This can be easily done by direct substitution on df.columns.

    In addition, if movies receiving no ratings were to be listed as well, just insert the missing movieIDs directly before name substitution mentioned previously.

    Code

    import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame(
        data={
            "movieId": [1,2,3,4,5],
            "title": ["toy story (1995)",
                      "Jumanji (1995)",
                      "Grumpier 33 (1995)",  # shortened for printing
                      "Waiting 44 (1995)",
                      "Father 55 (1995)"],
    
        }
    )
    # to better demonstrate the correctness, 2 distinct user ids were used.
    df2 = pd.DataFrame(
        data={
            "userId": [1,1,1,2,2],
            "movieId": [1,2,2,3,5],
            "rating": [4,5,4,5,4]
        }
    )
    
    # 1. Produce the main table
    df_new = df2.pivot_table(index=["userId"], columns=["movieId"], values="rating")
    
    print(df_new)  # already pretty close
    Out[17]: 
    movieId    1    2    3    5
    userId                     
    1        4.0  4.5  NaN  NaN
    2        NaN  NaN  5.0  4.0
    
    
    # 2. map movie ID's to titles
    
    # name lookup dataset
    df_names = df1[["movieId", "title"]].set_index("movieId")
    
    # strip the last 7 characters containing year
    # (assume consistent formatting in df1)
    df_names["title"] = df_names["title"].apply(lambda s: s[:-7])
    
    # (optional) fill unrated columns and sort
    for movie_id in df_names.index.values:
        if movie_id not in df_new.columns.values:
            df_new[movie_id] = np.nan
    else:
        df_new = df_new[df_names.index.values]
    
    # replace IDs with titles
    df_new.columns = df_names.loc[df_new.columns, "title"].values
    

    Result

    df_new
    Out[16]: 
            toy story  Jumanji  Grumpier 33  Waiting 44  Father 55
    userId                                                        
    1             4.0      4.5          NaN         NaN        NaN
    2             NaN      NaN          5.0         NaN        4.0