Search code examples
rregexextractfilenames

Extracting dates from long filename


I've read through some of the other questions on here about extracting dates (or different sections) from filenames, but I can't seem to get any of the other answers to work on my filenames. I have a list of >15,000 filenames from a directory and I need to extract the dates from the filenames so I can then figure out which dates I am missing (I should have 15,706 in total, but in some dirs. I only have ~15,600)

Here's an example

maxTemps <- list.files("./Daily/Daily_TMax/", recursive = TRUE, pattern = ".asc$", full.names = FALSE)
length(maxTemps)
[1] 15697

head(maxTemps)
[1] "1970/eMAST_ANUClimate_day_tmax_v1m0_19700101.asc" "1970/eMAST_ANUClimate_day_tmax_v1m0_19700102.asc"
[3] "1970/eMAST_ANUClimate_day_tmax_v1m0_19700103.asc" "1970/eMAST_ANUClimate_day_tmax_v1m0_19700104.asc"
[5] "1970/eMAST_ANUClimate_day_tmax_v1m0_19700105.asc" "1970/eMAST_ANUClimate_day_tmax_v1m0_19700106.asc"

tail(maxTemps)
[1] "2012/eMAST_ANUClimate_day_tmax_v1m0_20121226.asc" "2012/eMAST_ANUClimate_day_tmax_v1m0_20121227.asc"
[3] "2012/eMAST_ANUClimate_day_tmax_v1m0_20121228.asc" "2012/eMAST_ANUClimate_day_tmax_v1m0_20121229.asc"
[5] "2012/eMAST_ANUClimate_day_tmax_v1m0_20121230.asc" "2012/eMAST_ANUClimate_day_tmax_v1m0_20121231.asc"

I've been able to use the following code to get the years (based on the folder)

regmatches(maxTemp, regexpr("[0-9]{4}", maxTemp))

I thought I could use this, with invert = TRUE to return the rest of the strings because if I try to include the constant part of the filename in the regexpr I get an error

maxTempsFiles <- regmatches(maxTemp, regexpr("[0-9]{4}\/(eMAST_ANUClimate_day_tmax_v1m0_)", maxTemp), invert = TRUE)
Error: '\/' is an unrecognized escape in character string starting ""[0-9]{4}\/"

So I thought I could use the code that works and then subset out the constant part of the filename leaving me with the date, and then I would just need to remove the .asc with sub, but this returns some messy text

maxTempsFiles <- regmatches(maxTemp, regexpr("[0-9]{4}", maxTemp), invert = TRUE)
maxTempsFiles <- sub(x = maxTempsFiles, pattern = "/eMAST_ANUClimate_day_tmax_v1m0_", replacement = "")
maxTempsFiles <- sub(x = maxTempsFiles, pattern = ".asc", replacement = "")
head(maxTempsFiles)
[1] "c(\"\", \"19700101\")" "c(\"\", \"19700102\")" "c(\"\", \"19700103\")" "c(\"\", \"19700104\")" "c(\"\", \"19700105\")"
[6] "c(\"\", \"19700106\")"

The files always have /eMAST_ANUClimate_day_prec_v1m0_ in them, it is just the first folder which changes, and the end of the filename 19700101.asc through to 20121231.asc

If someone could provide some code/advice on how best to do this, that'd be great.


Solution

  • This is a simple searching for partial matches of the string using groups - and returning the desired match in a group.

    gsub("(^.*_)(\\d+)\\.asc$", "\\2", x)
    

    Regex explanation:

    group 1:
      (^.*_) - match beginning of string (^) and then any character until _ is found
    group 2:
      (\\d+) - find any digit, several times (+)
    no group:
      \\.asc$ - at last, find .asc, which should be the end of the string ($)
    

    replacement argument in gsub is there to replace matched part of the string, or, return a desired group. For group 2, you will want \\2. Difference between sub and gsub is that former will return only first matched pattern while gsub will work on the entire vector.