I am manipulating a very large movie data set. The data is as below (example)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
If you observe in the genre column, certain genres are comma separated and few are space separated. And the word SciFi has two different appearances: SciFi and Sci-Fi. This is my situation in the entire data set that has around 5000 movies.
I am stuck with an appropriate approach for the following results:
genre1 = Adventure
genre2= Drama
I've used the following command:
movie_genres<-separate(movies, genre, into=c(genre1, genre2, genre3)
The above command is separating the word Sci-Fi as two genres (Sci and Fi or only Sci).
I usually start by "cleaning" the data. In this case, I'd make the formatting of your genre column consistent (genres column separated, no trailing spaces, ...) and then use separate.
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
movies$genre <- str_replace_all(movies$genre, ",\\s+", ",")
movies$genre <- str_replace_all(movies$genre, "\\s+$", "")
movies$genre <- str_replace_all(movies$genre, "\\s+", ",")
movies$genre <- str_replace_all(movies$genre, "Sci-Fi", "SciFi")
#> [1] "Adventure,Drama,SciFi" "Adventure,Comedy,SciFi" "Adventure,SciFi"
#> [4] "Adventure,Drama,SciFi"
separate(movies, genre, into = c("genre1", "genre2", "genre3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [3].
#> title genre1 genre2 genre3
#> 1 Interstellar Adventure Drama SciFi
#> 2 Back to the Future Adventure Comedy SciFi
#> 3 2001: A Space Odyssey Adventure SciFi <NA>
#> 4 The Martian Adventure Drama SciFi
Created on 2023-01-31 by the reprex package (v2.0.1)