Search code examples
rstringextract

How to extract text from a column using R


How would I go about extracting, for each row (there are ~56,000 records in an Excel file) in a specific column, only part of a string? I need to keep all text to the left of the last '/' forward slash. The challenge is that not all cells have the same number of '/'. There is always a filename (*.wav) at the end of the last '/', but the number of characters in the filename is not always the same (sometimes 5 and sometimes 6).

Below are some examples of the strings in the cells:
cloch/51.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav
AB_AeolinaL/025-C#.wav
AB_AeolinaL/026-D.wav
AB_violadamourL/rel99999/091-G.wav
AB_violadamourL/rel99999/092-G#.wav
AB_violadamourR/024-C.wav
AB_violadamourR/025-C#.wav

The extracted text should be:
cloch
grand/Grand_bombarde/02-suchy_Grand_bombarde
grand/Grand_bombarde/02-suchy_Grand_bombarde
AB_AeolinaL
AB_AeolinaL
AB_violadamourL/rel99999
AB_violadamourL/rel99999
AB_violadamourR
AB_violadamourR

Can anyone recommend a strategy using R?


Solution

  • You can use the stringr package str_remove(string,pattern) function like:

    str = "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav"
    str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
    

    Output:

    > str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
    [1] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
    

    Then you can just iterate over all other strings:

    strings <- c("cloch/51.wav",
                 "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
                 "grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
                 "AB_AeolinaL/025-C#.wav",
                 "AB_AeolinaL/026-D.wav",
                 "AB_violadamourL/rel99999/091-G.wav",
                 "AB_violadamourL/rel99999/092-G#.wav",
                 "AB_violadamourR/024-C.wav",
                 "AB_violadamourR/025-C#.wav")
    str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
    

    Output:

    > str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
    [1] "cloch"                                       
    [2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
    [3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
    [4] "AB_AeolinaL"                                 
    [5] "AB_AeolinaL"                                 
    [6] "AB_violadamourL/rel99999"                    
    [7] "AB_violadamourL/rel99999"                    
    [8] "AB_violadamourR"                             
    [9] "AB_violadamourR"