Search code examples
rdatelong-filenames

R - Select files by dates in filenames


I already had a similar question here: R - How to choose files by dates in file names?

But I have to do a little change.

I still have a list of filenames, similar to that:

list = c("AT0ACH10000700100dymax.1-1-1993.31-12-2003",
         "AT0ILL10000700500dymax.1-1-1990.31-12-2011", 
         "AT0PIL10000700500dymax.1-1-1992.31-12-2011",
         "AT0SON10000700100dymax.1-1-1990.31-12-2011",
         "AT0STO10000700100dymax.1-1-1992.31-12-2006",  
         "AT0VOR10000700500dymax.1-1-1981.31-12-2011",
         "AT110020000700100dymax.1-1-1993.31-12-2001",
         "AT2HE190000700100dymax.1-1-1973.31-12-1994", 
         "AT2KA110000700500dymax.1-1-1991.31-12-2010", 
         "AT2KA410000700500dymax.1-1-1991.31-12-2011")

I already have a command to sort out files that a certain length of recording (for example 10 in this case):

#Listing Files (creates the list above)
files = list.files(pattern="*00007.*dymax", recursive = TRUE)

#Making date readable
split_daymax = strsplit(files, split=".", fixed=TRUE)

from = unlist(lapply(split_daymax, "[[", 2))
to = unlist(lapply(split_daymax, "[[", 3))
from = as.POSIXct(from, format="%d-%m-%Y")
to = as.POSIXct(to, format="%d-%m-%Y")

timelistmax = difftime(to, from, "days")

#Files with more than 10 years of recording
index = timelistmax >= 10*360
filesdaymean = filesdaymean[index]

My problem is now that I have way too many files and no computer can handle that.

Now I only want to read in files that contain files from 1993 (or any other certain year I want) on and have 10 years of recording from then on, so the recordings should be at least until 2003.

So the file 1973-1994 should not be included, but the file from 1981- 2011 is fine.

I dont know how to select a year in this case.

I am thankful for any help


Solution

  • library(stringr)
    library(lubridate)
    fileDates <- str_extract_all(files, "[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}")
    
    find_file <- function(x, whichYear, noYears = 10) {
      start <- as.Date(x[[1]], "%d-%m-%Y")
      end <- as.Date(x[[2]], "%d-%m-%Y")
      years <- as.numeric(end-whichYear, units = "days")/365
      years > noYears & (year(start) <= year(whichYear) & 
                           year(end) >= year(whichYear))
    }
    sapply(fileDates, find_file, whichYear = as.Date("1993-01-01"), noYears = 10)
    

    You have two conditions which you can calculate first the number of years since 1993 and then use boolean logic to figure out if 1993 is within the date range.