Search code examples
rdata-cleaningone-hot-encoding

Movies Dataset - Encoding variable that is a list of top four actors in that movie (R)


This is my dataset: when I filter for Actors column, I get a list of list (of 4 actors per movie)

head(movies$Actors)

[[1]] [1] "Rishab Shetty" " Sapthami Gowda" " Kishore Kumar G." [4] " Achyuth Kumar"

[[2]] [1] "Christian Bale" " Heath Ledger" " Aaron Eckhart" " Michael Caine"

[[3]] [1] "Elijah Wood" " Viggo Mortensen" " Ian McKellen"
[4] " Orlando Bloom"

[[4]] [1] "Leonardo DiCaprio" " Joseph Gordon-Levitt" " Elliot Page"
[4] " Ken Watanabe"

[[5]] [1] "Elijah Wood" " Ian McKellen" " Viggo Mortensen" [4] " Orlando Bloom"

[[6]] [1] "Elijah Wood" " Ian McKellen" " Orlando Bloom" " Sean Bean"

Since there are 5000 rows, there are way too many actors to use for one hot encoding. What I tried to do is find the top 20 actors (using sort() and table() ), and then to add a binary variable that states if a particular movie has any of the top e.g.20 actors in it, as this might be a simple proxy for whether the movie has good ratings.

Unfortunately, the code doesn't work. Can't seem to google my way out of this either. Can anyone help me?

## get 20 biggest actors in terms of number of movies 
top20actorstable <- sort(table(actorlist), decreasing = T)[1:20]
names(top20actorstable)
## one hot encoding 
top20actorsnames <- names(top20actorstable)

movies$bigactor <- NA

for (i in nrow(movies)){
  listactors <- unlist(movies[i,]$Actors)
  if (any(is.element(listactors, top20actorsnames))){
    movies[i,]$bigactor <- 1 
  }
  else {movies[i,]$bigactor <- 0}
}

Edit:

> dput(head(movies$Actors, 10))
list(c("Rishab Shetty", " Sapthami Gowda", " Kishore Kumar G.", 
" Achyuth Kumar"), c("Christian Bale", " Heath Ledger", " Aaron Eckhart", 
" Michael Caine"), c("Elijah Wood", " Viggo Mortensen", " Ian McKellen", 
" Orlando Bloom"), c("Leonardo DiCaprio", " Joseph Gordon-Levitt", 
" Elliot Page", " Ken Watanabe"), c("Elijah Wood", " Ian McKellen", 
" Viggo Mortensen", " Orlando Bloom"), c("Elijah Wood", " Ian McKellen", 
" Orlando Bloom", " Sean Bean"), c("Keanu Reeves", " Laurence Fishburne", 
" Carrie-Anne Moss", " Hugo Weaving"), c("Mark Hamill", " Harrison Ford", 
" Carrie Fisher", " Billy Dee Williams"), c("Arnold Schwarzenegger", 
" Linda Hamilton", " Edward Furlong", " Robert Patrick"), c("Mark Hamill", 
" Harrison Ford", " Carrie Fisher", " Alec Guinness"))

What I meant by "code doesn't work": I was hoping for the for loop to, one by one, check within the list of actors of each row, unlist them and check against the list of top20actors - if there is one of the top actors, then the bigactor column would be a 1, otherwise 0.

However, when I check the column after the for loop, it returns NA:

> for (i in nrow(movies)){
+   listactors <- unlist(movies[i,]$Actors)
+   if (any(is.element(listactors, top20actorsnames))){
+     movies[i,]$bigactor <- 1 
+   }
+   else {movies[i,]$bigactor <- 0}
+ }
Warning: provided 11 variables to replace 10 variables
> movies$bigactor
NULL

Solution

  • Here is my approach. Make the list of actors of interest.
    Then loop through the list (using sapply()) of movies and find the movies containing (%in%) the actors of interest. Return a vector of TRUE/FALSE for corresponding to matches.

    movies <- list(c("Rishab Shetty", " Sapthami Gowda", " Kishore Kumar G.", "Achyuth Kumar"), 
                   c("Christian Bale", " Heath Ledger", " Aaron Eckhart", " Michael Caine"), 
                   c("Elijah Wood", " Viggo Mortensen", " Ian McKellen",  " Orlando Bloom"), 
                   c("Leonardo DiCaprio", " Joseph Gordon-Levitt", " Elliot Page", " Ken Watanabe"), 
                   c("Elijah Wood", " Ian McKellen", " Viggo Mortensen", " Orlando Bloom"), 
                   c("Elijah Wood", " Ian McKellen",  " Orlando Bloom", " Sean Bean"), 
                   c("Keanu Reeves", " Laurence Fishburne",  " Carrie-Anne Moss", " Hugo Weaving"), 
                   c("Mark Hamill", " Harrison Ford", " Carrie Fisher", " Billy Dee Williams"), 
                   c("Arnold Schwarzenegger",      " Linda Hamilton", " Edward Furlong", " Robert Patrick"), 
                   c("Mark Hamill", " Harrison Ford", " Carrie Fisher", " Alec Guinness"))
    
    
    
    #create actors list
    #adding trimws to remove leading and trailing spaces
    actorlist <- unlist(movies) |> trimws()
    #shortened down to 7 for debugging
    top20actorstable <- sort(table(actorlist), decreasing = T)[1:7] |> names()
    
    #loop through the list looking for matching actors
    #returns a vector of true/false for the matches
    bigactor <- sapply(movies, function(movie) {
       any(trimws(movie) %in% top20actorstable)
    })
    bigactor
    as.integer(bigactor)
    

    Since the data sample you provided is a list, I am not sure where the final results are stored. You could try to store your list of vectors in a data frame but that is complicated and very helpful.