Search code examples
rtextcharacterreadrtidytext

add text to atomic (character) vector in r


Good afternoon, I am not an expert in the topic of atomic vectors but I would like some ideas about it

I have the script for the movie "Coco" and I want to be able to get a row that is numbered in the form 1., 2., ... (130 scenes throughout the movie). I want to convert the line of each scene of the movie into a row that contains "Scene 1", "Scene 2", up to "Scene 130" and achieve it sequentially.

url <- "https://www.imsdb.com/scripts/Coco.html"

coco <- read_lines("coco2.txt") #after clean 
class(coco)
typeof(coco)

"                                                                        48."      
 [782] "     arms full of offerings."                                                     
 [783] "      Once the family clears, Miguel is nowhere to be seen."                      
 [784] "      INT. NEARBY CORRIDOR"                                                       
 [785] "     Miguel and Dante hide from the patrolman.     But Dante wanders"             
 [786] "     off to inspect a side room."                                                 
 [787] "      INT. DEPARTMENT OF CORRECTIONS"                                             
 [788] "     Miguel catches up to Dante.      He overhears an exchange in a"              
 [789] "     nearby cubicle."                                                             

 [797] "                                                          49."                    
 [798] "                 And amigos, they help their amigos."                             
 [799] "                 worth your while."                                               
 [800] "     workstation."                                                                
 [801] "      Miguel perks at the mention of de la Cruz."                                 


 [809] "      Miguel follows him."                                                        
 [810] "                                                                     50." # Its scene number     
 [811] "      INT. HALLWAY"      


s <- grep(coco, pattern = "[^Level].[0-9].$", value = TRUE)

My solution is wrong because it is not sequential

v <- gsub(s, pattern = "[^Level].[0-9].$", replacement = paste("Scene", sequence(1:130)))


[1] "                                                                   Scene1"          
  [2] "                                                                   Scene1"          
  [3] "                                                                  Scene1"           
  [4] "                                                                       Scene1"      
  [5] "                                                                    Scene1"         
  [6] "                                                                   Scene1"          

Solution

  • I'm not clear on what [^Level] represents. However, if the numbers at the end of lines in the text represent the Scene numbers, then you can use ( ) to capture the numbers and substitute them in your replacement text as shown below:

     v <- gsub(s, pattern = " ([0-9]{1,3})\\.$", replacement = "Scene \\1")