Search code examples
rregexstringfilenamestruncate

How to truncate/modify filenames in batches in R?


I have a long list of CSV files, for examples they look like this:

names <- c("CHE1Q_S1001M1_20220615_025815_AM_Original.csv", "CHE2Q_S1002M1_20220615_030435_AM_Original.csv", "CHE6Q_S1053M2_20220615_033828_PM_Original.csv")

and I wish to batch shorten them into: "CHE1Q_S1001M1.csv", "CHE2Q_S1002M1.csv", "CHE6Q_S1053M2"

I have tried using the sub() function like this:

sub('_.*', '', names)

but it only returns "CHE1Q" "CHE2Q" "CHE6Q".

Or:

sub('_.*\\_', '', names)

gave "CHE1QAverageSpectrum.csv" "CHE2QAverageSpectrum.csv" "CHE6QAverageSpectrum.csv"

I don't know how to make it ignores the first underscore but remove everything from the second underscore.

The best I can get is two steps:

names <- sub('_', '', names)
names <- sub('_.*', '', names)

and I can get the information but can't get the underscore in the middle: "CHE1QS1001M1" "CHE2QS1002M1" "CHE6QSB053M2"


Solution

  • You can use a regex lookahead to identify strings before the second underscore.

    Explanation:

    • ^ starts of the string
    • .+? any number of characters
    • _ followed by a underscore
    • .+? again any number of characters
    • (?=_) match one character before _ (this is your second underscore)
    • (.+?_.+?(?=_)) put everything mentioned above into a capture group (note the bracket () surrounding it)
    • .* match any characters after the capture group till the end of the string
    • \\1.csv call back the strings in the capture group and add ".csv" after it
    sub("^(.+?_.+?(?=_)).*", "\\1.csv", names, perl = T)
    [1] "CHE1Q_S1001M1.csv" "CHE2Q_S1002M1.csv" "CHE6Q_S1053M2.csv"