Search code examples
rdata-manipulation

Separate words in a column that are separated by space and comma into different columns in R


I am manipulating a very large movie data set. The data is as below (example)

Ex:

title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")

movies<-data.frame(title, genre)

If you observe in the genre column, certain genres are comma separated and few are space separated. And the word SciFi has two different appearances: SciFi and Sci-Fi. This is my situation in the entire data set that has around 5000 movies.

I am stuck with an appropriate approach for the following results:

  1. How to separate the genre of each movie into individual genres. Ex: I want to separate genre for Interstellar as:

genre1 = Adventure


genre2= Drama


genre3=Sci-Fi


I've used the following command:

movie_genres<-separate(movies, genre, into=c(genre1, genre2, genre3)

The above command is separating the word Sci-Fi as two genres (Sci and Fi or only Sci).

  1. How to delete the hyphen (-) in the word Sci-Fi across the entire genre so that separate function works well.

OR

  1. Is there a work around that adds comma between genres (in genre column) and separate them by comma alone?

Solution

  • I usually start by "cleaning" the data. In this case, I'd make the formatting of your genre column consistent (genres column separated, no trailing spaces, ...) and then use separate.

    library(stringr)
    library(tidyr)
    title<-c("Interstellar", "Back to the Future", "2001: A Space Odyssey", "The Martian")
    genre<-c("Adventure, Drama, SciFi ", "Adventure Comedy SciFi", "Adventure, Sci-Fi", "Adventure Drama Sci-Fi")
    
    movies<-data.frame(title, genre)
    movies$genre <- str_replace_all(movies$genre, ",\\s+", ",") 
    movies$genre <- str_replace_all(movies$genre, "\\s+$", "") 
    movies$genre <- str_replace_all(movies$genre, "\\s+", ",") 
    movies$genre <- str_replace_all(movies$genre, "Sci-Fi", "SciFi")
    movies$genre
    #> [1] "Adventure,Drama,SciFi"  "Adventure,Comedy,SciFi" "Adventure,SciFi"       
    #> [4] "Adventure,Drama,SciFi"
    separate(movies, genre, into = c("genre1", "genre2", "genre3"))
    #> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [3].
    #>                   title    genre1 genre2 genre3
    #> 1          Interstellar Adventure  Drama  SciFi
    #> 2    Back to the Future Adventure Comedy  SciFi
    #> 3 2001: A Space Odyssey Adventure  SciFi   <NA>
    #> 4           The Martian Adventure  Drama  SciFi
    

    Created on 2023-01-31 by the reprex package (v2.0.1)