This is my dataset: when I filter for Actors column, I get a list of list (of 4 actors per movie)
head(movies$Actors)
[[1]] [1] "Rishab Shetty" " Sapthami Gowda" " Kishore Kumar G." [4] " Achyuth Kumar"
[[2]] [1] "Christian Bale" " Heath Ledger" " Aaron Eckhart" " Michael Caine"
[[3]] [1] "Elijah Wood" " Viggo Mortensen" " Ian McKellen"
[4] " Orlando Bloom"[[4]] [1] "Leonardo DiCaprio" " Joseph Gordon-Levitt" " Elliot Page"
[4] " Ken Watanabe"[[5]] [1] "Elijah Wood" " Ian McKellen" " Viggo Mortensen" [4] " Orlando Bloom"
[[6]] [1] "Elijah Wood" " Ian McKellen" " Orlando Bloom" " Sean Bean"
Since there are 5000 rows, there are way too many actors to use for one hot encoding. What I tried to do is find the top 20 actors (using sort() and table() ), and then to add a binary variable that states if a particular movie has any of the top e.g.20 actors in it, as this might be a simple proxy for whether the movie has good ratings.
Unfortunately, the code doesn't work. Can't seem to google my way out of this either. Can anyone help me?
## get 20 biggest actors in terms of number of movies
top20actorstable <- sort(table(actorlist), decreasing = T)[1:20]
names(top20actorstable)
## one hot encoding
top20actorsnames <- names(top20actorstable)
movies$bigactor <- NA
for (i in nrow(movies)){
listactors <- unlist(movies[i,]$Actors)
if (any(is.element(listactors, top20actorsnames))){
movies[i,]$bigactor <- 1
}
else {movies[i,]$bigactor <- 0}
}
Edit:
> dput(head(movies$Actors, 10))
list(c("Rishab Shetty", " Sapthami Gowda", " Kishore Kumar G.",
" Achyuth Kumar"), c("Christian Bale", " Heath Ledger", " Aaron Eckhart",
" Michael Caine"), c("Elijah Wood", " Viggo Mortensen", " Ian McKellen",
" Orlando Bloom"), c("Leonardo DiCaprio", " Joseph Gordon-Levitt",
" Elliot Page", " Ken Watanabe"), c("Elijah Wood", " Ian McKellen",
" Viggo Mortensen", " Orlando Bloom"), c("Elijah Wood", " Ian McKellen",
" Orlando Bloom", " Sean Bean"), c("Keanu Reeves", " Laurence Fishburne",
" Carrie-Anne Moss", " Hugo Weaving"), c("Mark Hamill", " Harrison Ford",
" Carrie Fisher", " Billy Dee Williams"), c("Arnold Schwarzenegger",
" Linda Hamilton", " Edward Furlong", " Robert Patrick"), c("Mark Hamill",
" Harrison Ford", " Carrie Fisher", " Alec Guinness"))
What I meant by "code doesn't work": I was hoping for the for loop to, one by one, check within the list of actors of each row, unlist them and check against the list of top20actors - if there is one of the top actors, then the bigactor column would be a 1, otherwise 0.
However, when I check the column after the for loop, it returns NA:
> for (i in nrow(movies)){
+ listactors <- unlist(movies[i,]$Actors)
+ if (any(is.element(listactors, top20actorsnames))){
+ movies[i,]$bigactor <- 1
+ }
+ else {movies[i,]$bigactor <- 0}
+ }
Warning: provided 11 variables to replace 10 variables
> movies$bigactor
NULL
Here is my approach. Make the list of actors of interest.
Then loop through the list (using sapply()
) of movies and find the movies containing (%in%
) the actors of interest. Return a vector of TRUE/FALSE for corresponding to matches.
movies <- list(c("Rishab Shetty", " Sapthami Gowda", " Kishore Kumar G.", "Achyuth Kumar"),
c("Christian Bale", " Heath Ledger", " Aaron Eckhart", " Michael Caine"),
c("Elijah Wood", " Viggo Mortensen", " Ian McKellen", " Orlando Bloom"),
c("Leonardo DiCaprio", " Joseph Gordon-Levitt", " Elliot Page", " Ken Watanabe"),
c("Elijah Wood", " Ian McKellen", " Viggo Mortensen", " Orlando Bloom"),
c("Elijah Wood", " Ian McKellen", " Orlando Bloom", " Sean Bean"),
c("Keanu Reeves", " Laurence Fishburne", " Carrie-Anne Moss", " Hugo Weaving"),
c("Mark Hamill", " Harrison Ford", " Carrie Fisher", " Billy Dee Williams"),
c("Arnold Schwarzenegger", " Linda Hamilton", " Edward Furlong", " Robert Patrick"),
c("Mark Hamill", " Harrison Ford", " Carrie Fisher", " Alec Guinness"))
#create actors list
#adding trimws to remove leading and trailing spaces
actorlist <- unlist(movies) |> trimws()
#shortened down to 7 for debugging
top20actorstable <- sort(table(actorlist), decreasing = T)[1:7] |> names()
#loop through the list looking for matching actors
#returns a vector of true/false for the matches
bigactor <- sapply(movies, function(movie) {
any(trimws(movie) %in% top20actorstable)
})
bigactor
as.integer(bigactor)
Since the data sample you provided is a list, I am not sure where the final results are stored. You could try to store your list of vectors in a data frame but that is complicated and very helpful.