I have a folder containing .txt files named like the following:
A_COR_001_I
A_COR_001_II
A_COR_002_I
A_COR_002_II
A_COR_003_I
A_COR_003_II
A_COR_003_III
A_COR_004_I
A_COR_004_II
A_COR_004_III
A_COR_004_IV
...
The roman numerals at the end of each string signify the definitive draft of a distinct document, identified by the preceding arabic numbers, like 002. I am trying to extract only the final drafts with a regex pattern using a list.files() function, but the problem is that each document has an unpredictable number of drafts, so I would need a way to group together the drafts of each document and single out the ones with the highest number, so A_COR_004_IV instead of A_COR_004_III or any other. Any ideas on how to proceed? Thanks in advance!
Base R has an as.roman()
function which allows Simple manipulation of... roman numerals.
So split the files into lists by filename based on what appears before the last underscore (i.e. "A_COR_001"
to "A_COR_004"
) then find the element with the max()
roman numeral (i.e. max numeric value after the final underscore).
split(files, sub("_[^_]+$", "", files)) |>
lapply(
\(l) l[which.max(as.roman(sub(".*_", "", l)))]
)
# $A_COR_001
# [1] "A_COR_001_II"
# $A_COR_002
# [1] "A_COR_002_II"
# $A_COR_003
# [1] "A_COR_003_III"
# $A_COR_004
# [1] "A_COR_004_IV"
I imagine this will not be a problem here but note that the docs state:
Only numbers between 1 and 3999 have a unique representation as roman numbers, and hence others result in
as.roman(NA)
.
Interestingly, this is actually just structure(NA_integer_, class = "roman")
.
Incidentally, list.files()
will return the results in lexicographic order, which if you have at most 8 versions of all files is the order that you want (until IX
). So you can just do lapply(split(files, sub("_[^_]+$", "", files)), tail, 1)
.
files <- c( "A_COR_001_I", "A_COR_001_II", "A_COR_002_I", "A_COR_002_II", "A_COR_003_I", "A_COR_003_II", "A_COR_003_III", "A_COR_004_I", "A_COR_004_II", "A_COR_004_III", "A_COR_004_IV" )