Search code examples
regexrfilefilenames

Retrieve portion of file name


I have a number of files with the following format:

sub_(number 1 to 60)_ sess_(number 1, 2, or 3)_ (some letters)_ DDMMMYYYY_(some number with either 3 or 4 digit).txt

For example:

sub_41_sess_2_ABCxyz_23Feb2016_2932.txt

I want to retrieve only the portion '(1, 2, or 3)' after the 'sess_' portion and I think sub() function can return all those numbers. I refer to these URLs, here and here.

Here's the code I tried, didn't work:

dir <- "path/"
filelist = list.files(path = dir, pattern = ".*.txt")
filelist

for (f in filelist) {

    sess_id <- sub("^(sub_[1-60])^(_sess_)(1 |2 |3)^.*","\\1",c(f), perl = TRUE)

}
sess_id

What was returned was a single filename that looks like this:

[1] "subject_9_4Feb2016_1611.txt"

I am expecting something like below, because I need each sess_id to be an attribute of the files with overall file format stated above.

[1] "1" or [1] "2" 

Solution

  • We can do this using gsub by matching all the characters until the sess followed by _ or | the characters that start with _ followed by upper case letters followed by characters (.*) until the end of the string ($), and replace with ''.

    gsub('^.*sess\\_|\\_[A-Z]+.*$', '', str1)
    #[1] "2"
    

    Or using str_extract, it would be much more compact. By default, str_extract only extract the first occurrence of the match. Here we extract the numbers (\\d+) that follow the regex lookaround ((?<=sess_)).

    library(stringr)
    str_extract(str1, '(?<=sess_)\\d+')
    #[1] "2"
    

    data

    str1 <- "sub_41_sess_2_ABCxyz_23Feb2016_2932.txt")