I have a number of files with the following format:
sub_(number 1 to 60)_ sess_(number 1, 2, or 3)_ (some letters)_ DDMMMYYYY_(some number with either 3 or 4 digit).txt
For example:
sub_41_sess_2_ABCxyz_23Feb2016_2932.txt
I want to retrieve only the portion '(1, 2, or 3)' after the 'sess_' portion and I think sub()
function can return all those numbers. I refer to these URLs, here and here.
Here's the code I tried, didn't work:
dir <- "path/"
filelist = list.files(path = dir, pattern = ".*.txt")
filelist
for (f in filelist) {
sess_id <- sub("^(sub_[1-60])^(_sess_)(1 |2 |3)^.*","\\1",c(f), perl = TRUE)
}
sess_id
What was returned was a single filename that looks like this:
[1] "subject_9_4Feb2016_1611.txt"
I am expecting something like below, because I need each sess_id
to be an attribute of the files with overall file format stated above.
[1] "1" or [1] "2"
We can do this using gsub
by matching all the characters until the sess
followed by _
or |
the characters that start with _
followed by upper case letters followed by characters (.*
) until the end of the string ($
), and replace with ''
.
gsub('^.*sess\\_|\\_[A-Z]+.*$', '', str1)
#[1] "2"
Or using str_extract
, it would be much more compact. By default, str_extract
only extract the first occurrence of the match. Here we extract the numbers (\\d+
) that follow the regex lookaround ((?<=sess_)
).
library(stringr)
str_extract(str1, '(?<=sess_)\\d+')
#[1] "2"
str1 <- "sub_41_sess_2_ABCxyz_23Feb2016_2932.txt")