Search code examples
rregexfind-occurrences

Locating right file in the directory with keywords and comparison in R


I am very new to r and I have no experience with regular expressions and any help would be really appreciated.

I am reading in a dir and I am trying to find files with the number "22953" and then I want to read the newest file containing this. The date is also written in the files' name.

Files in the directory:

inv_22953_20190828023258_112140.csv
inv_22953_20190721171018_464152.csv
inv_8979_20190828024558_112140.csv

The problem that I have here is that I can't really depend on the place of the string to get the date because as you can see some files might have fewer characters that is why maybe a solution would be to locate the date between the 2nd and 3rd.

filepath <- "T:/Pricing/Workstreams/Business Management/EU/01_Operations/02_Carveouts/05_ImplementationTest/"

list.files(filepath)[which.max(suppressWarnings(ymd_hm(substr(list.files(filepath, pattern="_22953"),11,22))))]```

Solution

  • library(lubridate)
    
    # First find the files with 22953 inside
    myFiles <- grep("22953", list.files(filepath), value = T)
    
    # Then, isolate the date and which file has the newest (maximum) date:
    
    regex <- "^.*_.*_([0-9]{4})([0-9]{2})([0-9]{2}).*\\.csv$"
    
    myFiles[which(as_date(sub(regex, "\\1-\\2-\\3", myFiles)) == max(as_date(sub(regex, "\\1-\\2-\\3", myFiles))))]
    

    Explanation of the regular expression

    • ^ matches the beginning of a string (says "whatever comes next is the beginning")
    • .* matches anything 0+ times
    • _ matches an underscore
    • [0-9]{4} finds 4 numbers between 0 and 9
    • [0-9]{2} finds 2 numbers between 0 and 9
    • stuff between parentheses is captured for the replacement string
    • \\1 refers to first group in parentheses, \\2 the second, and \\3 the third
    • $ refers to the end of a string (says "the end of the string ends in .csv")