I've read through some of the other questions on here about extracting dates (or different sections) from filenames, but I can't seem to get any of the other answers to work on my filenames. I have a list of >15,000 filenames from a directory and I need to extract the dates from the filenames so I can then figure out which dates I am missing (I should have 15,706 in total, but in some dirs. I only have ~15,600)
Here's an example
maxTemps <- list.files("./Daily/Daily_TMax/", recursive = TRUE, pattern = ".asc$", full.names = FALSE)
length(maxTemps)
[1] 15697
head(maxTemps)
[1] "1970/eMAST_ANUClimate_day_tmax_v1m0_19700101.asc" "1970/eMAST_ANUClimate_day_tmax_v1m0_19700102.asc"
[3] "1970/eMAST_ANUClimate_day_tmax_v1m0_19700103.asc" "1970/eMAST_ANUClimate_day_tmax_v1m0_19700104.asc"
[5] "1970/eMAST_ANUClimate_day_tmax_v1m0_19700105.asc" "1970/eMAST_ANUClimate_day_tmax_v1m0_19700106.asc"
tail(maxTemps)
[1] "2012/eMAST_ANUClimate_day_tmax_v1m0_20121226.asc" "2012/eMAST_ANUClimate_day_tmax_v1m0_20121227.asc"
[3] "2012/eMAST_ANUClimate_day_tmax_v1m0_20121228.asc" "2012/eMAST_ANUClimate_day_tmax_v1m0_20121229.asc"
[5] "2012/eMAST_ANUClimate_day_tmax_v1m0_20121230.asc" "2012/eMAST_ANUClimate_day_tmax_v1m0_20121231.asc"
I've been able to use the following code to get the years (based on the folder)
regmatches(maxTemp, regexpr("[0-9]{4}", maxTemp))
I thought I could use this, with invert = TRUE
to return the rest of the strings because if I try to include the constant part of the filename in the regexpr
I get an error
maxTempsFiles <- regmatches(maxTemp, regexpr("[0-9]{4}\/(eMAST_ANUClimate_day_tmax_v1m0_)", maxTemp), invert = TRUE)
Error: '\/' is an unrecognized escape in character string starting ""[0-9]{4}\/"
So I thought I could use the code that works and then subset out the constant part of the filename leaving me with the date, and then I would just need to remove the .asc with sub
, but this returns some messy text
maxTempsFiles <- regmatches(maxTemp, regexpr("[0-9]{4}", maxTemp), invert = TRUE)
maxTempsFiles <- sub(x = maxTempsFiles, pattern = "/eMAST_ANUClimate_day_tmax_v1m0_", replacement = "")
maxTempsFiles <- sub(x = maxTempsFiles, pattern = ".asc", replacement = "")
head(maxTempsFiles)
[1] "c(\"\", \"19700101\")" "c(\"\", \"19700102\")" "c(\"\", \"19700103\")" "c(\"\", \"19700104\")" "c(\"\", \"19700105\")"
[6] "c(\"\", \"19700106\")"
The files always have /eMAST_ANUClimate_day_prec_v1m0_
in them, it is just the first folder which changes, and the end of the filename 19700101.asc
through to 20121231.asc
If someone could provide some code/advice on how best to do this, that'd be great.
This is a simple searching for partial matches of the string using groups - and returning the desired match in a group.
gsub("(^.*_)(\\d+)\\.asc$", "\\2", x)
Regex explanation:
group 1:
(^.*_) - match beginning of string (^) and then any character until _ is found
group 2:
(\\d+) - find any digit, several times (+)
no group:
\\.asc$ - at last, find .asc, which should be the end of the string ($)
replacement
argument in gsub
is there to replace matched part of the string, or, return a desired group. For group 2, you will want \\2
. Difference between sub
and gsub
is that former will return only first matched pattern while gsub
will work on the entire vector.