Search code examples
r

How can I list files in a folder based on version number?


I have a folder containing .txt files named like the following:

A_COR_001_I
A_COR_001_II
A_COR_002_I
A_COR_002_II
A_COR_003_I
A_COR_003_II
A_COR_003_III
A_COR_004_I
A_COR_004_II
A_COR_004_III
A_COR_004_IV
...

The roman numerals at the end of each string signify the definitive draft of a distinct document, identified by the preceding arabic numbers, like 002. I am trying to extract only the final drafts with a regex pattern using a list.files() function, but the problem is that each document has an unpredictable number of drafts, so I would need a way to group together the drafts of each document and single out the ones with the highest number, so A_COR_004_IV instead of A_COR_004_III or any other. Any ideas on how to proceed? Thanks in advance!


Solution

  • Base R has an as.roman() function which allows Simple manipulation of... roman numerals.

    So split the files into lists by filename based on what appears before the last underscore (i.e. "A_COR_001" to "A_COR_004") then find the element with the max() roman numeral (i.e. max numeric value after the final underscore).

    split(files, sub("_[^_]+$", "", files)) |>
        lapply(
            \(l) l[which.max(as.roman(sub(".*_", "", l)))]
        )
    # $A_COR_001
    # [1] "A_COR_001_II"
    
    # $A_COR_002
    # [1] "A_COR_002_II"
    
    # $A_COR_003
    # [1] "A_COR_003_III"
    
    # $A_COR_004
    # [1] "A_COR_004_IV"
    

    I imagine this will not be a problem here but note that the docs state:

    Only numbers between 1 and 3999 have a unique representation as roman numbers, and hence others result in as.roman(NA).

    Interestingly, this is actually just structure(NA_integer_, class = "roman").

    Incidentally, list.files() will return the results in lexicographic order, which if you have at most 8 versions of all files is the order that you want (until IX). So you can just do lapply(split(files, sub("_[^_]+$", "", files)), tail, 1).

    Data

    files <- c( "A_COR_001_I", "A_COR_001_II", "A_COR_002_I", "A_COR_002_II", "A_COR_003_I", "A_COR_003_II", "A_COR_003_III", "A_COR_004_I", "A_COR_004_II", "A_COR_004_III", "A_COR_004_IV" )