I have an issue with extracting a pattern from strings in R.
I have a list of file names that includes path and I need to extract a small subset of the name from the path to assign to another vector. I can't get the proper pattern definition to extract the desired part.
Here is some examples of file names. I have a list of 500+ therefore, the sub function needs to be generic.
files = c("G:\\Conservation\\Monitoring\\Conditions physico-chimiques\\CTD\\CODES_R\\Data\\CTD_parc_marin\\2009\\odf\\CTD_2009099_BSM2AMONT_1_DN.ODF",
"G:\\Conservation\\Monitoring\\Conditions physico-chimiques\\CTD\\CODES_R\\Data\\CTD_parc_marin\\2011\\odf\\CTD_2011PSSL_42_1_DN.ODF",
"G:\\Conservation\\Monitoring\\Conditions physico-chimiques\\CTD\\CODES_R\\Data\\CTD_parc_marin\\2011\\odf\\CTD_2011PSSL_10_1_DN.ODF",
"G:\\Conservation\\Monitoring\\Conditions physico-chimiques\\CTD\\CODES_R\\Data\\CTD_parc_marin\\2022\\odf\\CTD_2022PSSL_202201017ESTMARAVAL_1_DN.odf")
The patterns I am looking to extract are these:
pattern = c("BSM2AMONT_1_DN",
"PSSL_42_1_DN",
"PSSL_10_1_DN",
"ESTMARAVAL_1_DN")
So far, I have this function:
pattern <- sub(".*\\d{4,7}([A-Z]|\\d{9})(.*)\\.(ODF|odf)", "\\2", files)
This works almost perfectly, except for the last example where I get "STMARAVAL_1_DN" instead of the desired "ESTMARAVAL_1_DN".
Thank you for your help!
sub(".*\\d{4,7}_?(?:([A-Z])|\\d{9})(.*)\\.(ODF|odf)", "\\1\\2", files)
# [1] "BSM2AMONT_1_DN" "PSSL_42_1_DN" "PSSL_10_1_DN" "ESTMARAVAL_1_DN"
I changed ([A-Z]|\\d{9})
to (?:([A-Z])|\\d{9})
: the ?:
means to not capture a backreference, but then I capture the letter (to add it back into the replacement string "\\1\\2"
). I also added _?
before it in order to fix the first element of the results.