Search code examples
rregexsubstring

Problem subsetting a string from a filename


I have an issue with extracting a pattern from strings in R.

I have a list of file names that includes path and I need to extract a small subset of the name from the path to assign to another vector. I can't get the proper pattern definition to extract the desired part.

Here is some examples of file names. I have a list of 500+ therefore, the sub function needs to be generic.

files = c("G:\\Conservation\\Monitoring\\Conditions physico-chimiques\\CTD\\CODES_R\\Data\\CTD_parc_marin\\2009\\odf\\CTD_2009099_BSM2AMONT_1_DN.ODF",
"G:\\Conservation\\Monitoring\\Conditions physico-chimiques\\CTD\\CODES_R\\Data\\CTD_parc_marin\\2011\\odf\\CTD_2011PSSL_42_1_DN.ODF",
"G:\\Conservation\\Monitoring\\Conditions physico-chimiques\\CTD\\CODES_R\\Data\\CTD_parc_marin\\2011\\odf\\CTD_2011PSSL_10_1_DN.ODF",
"G:\\Conservation\\Monitoring\\Conditions physico-chimiques\\CTD\\CODES_R\\Data\\CTD_parc_marin\\2022\\odf\\CTD_2022PSSL_202201017ESTMARAVAL_1_DN.odf")

The patterns I am looking to extract are these:

pattern = c("BSM2AMONT_1_DN",
            "PSSL_42_1_DN",
            "PSSL_10_1_DN",
            "ESTMARAVAL_1_DN")

So far, I have this function:

pattern <- sub(".*\\d{4,7}([A-Z]|\\d{9})(.*)\\.(ODF|odf)", "\\2", files)

This works almost perfectly, except for the last example where I get "STMARAVAL_1_DN" instead of the desired "ESTMARAVAL_1_DN".

Thank you for your help!


Solution

  • sub(".*\\d{4,7}_?(?:([A-Z])|\\d{9})(.*)\\.(ODF|odf)", "\\1\\2", files)
    # [1] "BSM2AMONT_1_DN"  "PSSL_42_1_DN"    "PSSL_10_1_DN"    "ESTMARAVAL_1_DN"
    

    I changed ([A-Z]|\\d{9}) to (?:([A-Z])|\\d{9}): the ?: means to not capture a backreference, but then I capture the letter (to add it back into the replacement string "\\1\\2"). I also added _? before it in order to fix the first element of the results.