Search code examples
rmachine-learningrecommendation-engine

How can I model the effect of genre on movie ratings?


I'm doing a machine learning exercise in R using a larger version of the movielens dataset (10 million rows), where my task is to predict ratings in the validation set using the data in the training set. Currently my model is as follows:

Rating by user u for movie i = mu + b_i + b_u + epsilon, where mu is the mean rating, b_i is the effect of each movie, b_u is the effect of each user. Epsilon is supposed to be the random error term, but right now it also contains the effect of genres which I haven't accounted for.

Here's a screenshot of my current dataset for reference - note that the resid column contains the residual rating after subtracting mu, b_i, and b_u.

enter image description here

I'm stuck because I have no idea how to model the effect of genres. Does anyone have any tips on how I can proceed?


Solution

  • Main Idea: Convert each value in the "Genre" field as individual fields, (Comedy, Romance) with value (Y/N, 0/1).

    I am showing you with below sample data. This should give you an idea and you can proceed with your data.

    sample <- tribble(~ Values,
                      "apple|banana",
                      "orange|apple",
                      "banana|guava")
    sample
    

    Steps to do:

    1. Separate the values available in the field,using separate function of tidyr

      sample %>% separate(Values, into = c("val1","val2"), sep = "\\|") -> sample2
      sample2
      
    2. Gather all individual values into single column, using gather function of tidyr

      sample2 %>% gather(key = "col_name", value = "col_val", val1, val2) ->sample3
      sample3
      
    3. Finally, use "col_val" field to get the desired output. i.e. one-hot encoding.

      sample4 <- sample3 %>% select(2)
      sample4
      as.data.frame(model.matrix( ~ . -1, sample4))
      

      Let me know, if it helped you.

    Happy Learning!!!