I am manipulating a very large movie data set. The data is as below (example)
Ex:
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
If you observe in the genre column, certain genres are comma separated and few are space separated. And the word SciFi has two different appearances: SciFi and Sci-Fi. This is my situation in the entire data set that has around 5000 movies.
I am stuck with an appropriate approach for the following results:
genre1 = Adventure
genre2= Drama
genre3=Sci-Fi
I've used the following command:
movie_genres<-separate(movies, genre, into=c(genre1, genre2, genre3)
The above command is separating the word Sci-Fi as two genres (Sci and Fi or only Sci).
OR
I usually start by "cleaning" the data. In this case, I'd make the formatting of your genre column consistent (genres column separated, no trailing spaces, ...) and then use separate.
library(stringr)
library(tidyr)
title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
movies<-data.frame(title, genre)
movies$genre <- str_replace_all(movies$genre, ",\\s+", ",")
movies$genre <- str_replace_all(movies$genre, "\\s+$", "")
movies$genre <- str_replace_all(movies$genre, "\\s+", ",")
movies$genre <- str_replace_all(movies$genre, "Sci-Fi", "SciFi")
movies$genre
#> [1] "Adventure,Drama,SciFi" "Adventure,Comedy,SciFi" "Adventure,SciFi"
#> [4] "Adventure,Drama,SciFi"
separate(movies, genre, into = c("genre1", "genre2", "genre3"))
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [3].
#> title genre1 genre2 genre3
#> 1 Interstellar Adventure Drama SciFi
#> 2 Back to the Future Adventure Comedy SciFi
#> 3 2001: A Space Odyssey Adventure SciFi <NA>
#> 4 The Martian Adventure Drama SciFi
Created on 2023-01-31 by the reprex package (v2.0.1)