I'm doing a machine learning exercise in R using a larger version of the movielens dataset (10 million rows), where my task is to predict ratings in the validation set using the data in the training set. Currently my model is as follows:
Rating by user u for movie i = mu + b_i + b_u + epsilon, where mu is the mean rating, b_i is the effect of each movie, b_u is the effect of each user. Epsilon is supposed to be the random error term, but right now it also contains the effect of genres which I haven't accounted for.
Here's a screenshot of my current dataset for reference - note that the resid column contains the residual rating after subtracting mu, b_i, and b_u.
I'm stuck because I have no idea how to model the effect of genres. Does anyone have any tips on how I can proceed?
Main Idea: Convert each value in the "Genre" field as individual fields, (Comedy, Romance) with value (Y/N, 0/1).
I am showing you with below sample data. This should give you an idea and you can proceed with your data.
sample <- tribble(~ Values,
"apple|banana",
"orange|apple",
"banana|guava")
sample
Steps to do:
Separate the values available in the field,using separate function of tidyr
sample %>% separate(Values, into = c("val1","val2"), sep = "\\|") -> sample2
sample2
Gather all individual values into single column, using gather function of tidyr
sample2 %>% gather(key = "col_name", value = "col_val", val1, val2) ->sample3
sample3
Finally, use "col_val" field to get the desired output. i.e. one-hot encoding.
sample4 <- sample3 %>% select(2)
sample4
as.data.frame(model.matrix( ~ . -1, sample4))
Let me know, if it helped you.
Happy Learning!!!